This is a website for a book that serves as official documentation for OmicSHIELD. On it you will find introductory references to learn about DataSHIELD and “resources,” explanation on the type of analysis that can be performed using OmicSHIELD and workflows (with reproducible code) of the main functionalities of OmicSHIELD. Use cases in which OmicSHIELD is applied to real datasets are presented for illustrating the capabilities of the software for omic analyses (GWAS, transcriptomics and EWAS).
This material is intended to be a quick reference guide for new researchers interested in this technology as well as to be an online companion for the manuscript “Privacy-protected federated omic data analysis in multi-center studies with OmicSHIELD.”
All the functionalities described on this book have been developed at the Bioinformatic Research Group in Epidemiology (BRGE) of ISGlobal with external help from Yannick Marcon (Obiba); and are part of OmicSHIELD.
This website is free to use, and is licensed under a MIT license.
Along this book, there are some details regarding DataSHIELD and “resources” that are not explained in detail, it is expected that the reader is familiar with them. If that is not the case, there are other free online books/papers with that knowledge.
DataSHIELD paper: Description of what is DataSHIELD.
DataSHIELD wiki: Materials about DataSHIELD including:
resource book: In this book you will find information about:
We will be interacting with DataSHIELD through a data warehouse called Opal. This is the server that will handle the authentication of our credentials, storage of data and “resources” and will provide an R server where the non-disclosive analysis will be conducted. Information about it can also be foun online:
It is quite important to have a solid understanding of what are the “resources” and how we work with them, since in all the use cases we are interacting with them to load the Omic data on the R sessions. For that reason we included a very brief description of them without using technicalities.
The “resources” can be imagined as a data structure that contains the information about where to find a data set and the access credentials to it; we as DataSHIELD users are not able to look at this information (it is privately stored on the Opal server), but we can load it into our remote R session to make use of it. Following that, the next step comes naturally.
Once we have in an R session the information to access a dataset (an ExpressionSet for example) we have to actually retrieve it on the remote R session to analyze it. This step is called resolving the resource.
Those two steps can be identified on the code we provide as the following:
Loading the information of a “resource”:
DSI::datashield.assign.resource(conns, "resource", "resource.path.in.opal.server")
Resolving the “resource”:
DSI::datashield.assign.expr(conns, "resource.resolved", expr = as.symbol("as.resource.object(resource)"))
This toy code would first load the “resource” on a variable called resource and it would retrieve the information it contains and assign it to a variable called resource.resolved.
The functionalities of OmicSHIELD are built on top of the “resources” to work with different types of data objects, more precisely we have developed capabilities to work with the following R objects:
These objects are analyzed using BioConductor packages as well as custom-made functions. This ensures that researchers familiar with the BioConductor universe will feel at home when using OmicSHIELD.
Not only we can work using a BioConductor approach, we also developed functionalities to make use of command line tools that are traditionally used on omics analysis, those are:
This allow the researchers to perform analysis on federated data using their own command line based pipelines. Again this ensures that people familiar with those tools will be able to perform analysis easily.
Along this bookdown there are reproducible examples that make use of two different Opal servers. Information about the technology and resources about setting up Opal servers on your institution can be found on the following links 1, 2.
Information about the used Opal servers:
| Opal 1 | Opal 2 | |
|---|---|---|
| URL | https://opal-demo.obiba.org/ | https://datashield.isglobal.org/repo |
| Host | Obiba | ISGlobal |
| Cores | 12 | 72 |
| RAM | 18 GB | 218 GB |
| Details |
|
|
The Figure 2.1 describes the different types of ’omic association analyses that can be performed using DataSHIELD client functions implemented in the dsOmicsClient package. Basically, data (’omic and phenotypes/covariates) can be stored in different sites (http, ssh, AWS S3, local, …) and are managed with Opal through the resourcer package and their extensions implemented in dsOmics.
Figure 2.1: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how the resourcer package is used to get access to omic data through the Opal servers. Then DataSHIELD is used in the client side to perform non-disclosive data analyses.
Then, dsOmicsClient package allows different types of analyses: pooled and meta-analysis. Both methods are based on fitting different Generalized Linear Models (GLMs) for each feature when assesing association between ’omic data and the phenotype/trait/condition of interest. Of course, non-disclosive ’omic data analysis from a single study can also be performed.
The pooled approach (Figure 2.2) is recommended when the user wants to analyze ’omic data from different sources and obtain results as if the data were located in a single computer. It should be noted that this can be very time consuming when analyzing multiple features since it calls a base function in DataSHIELD (ds.glm) repeatedly. It also cannot be recommended when data are not properly harmonized (e.g. gene expression normalized using different methods, GWAS data having different platforms, …). Furthermore when it is necesary to remove unwanted variability (for transcriptomic and epigenomica analysis) or control for population stratification (for GWAS analysis), this approach cannot be used since we need to develop methods to compute surrogate variables (to remove unwanted variability) or PCAs (to to address population stratification) in a non-disclosive way.
The meta-analysis approach Figure 2.3 overcomes the limitations raised when performing pooled analyses. First, the computation issue is addressed by using scalable and fast methods to perform data analysis at whole-genome level at each location The transcriptomic and epigenomic data analyses make use of the widely used limma package that uses ExpressionSet or RangedSummarizedExperiment Bioc infrastructures to deal with ’omic and phenotypic (e.g covariates). The genomic data are analyzed using GWASTools and GENESIS that are designed to perform quality control (QC) and GWAS using GDS infrastructure.
Next, we describe how both approaches are implemented:
ds.glm() function which is a DataSHIELD function that uses an approach that is mathematically equivalent to placing all individual-level data froma all sources in one central warehouse and analysing those data using the conventional glm() function in R. The user can select one (or multiple) features (i.e., genes, transcripts, CpGs, SNPs, …)
Figure 2.2: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform single pooled omic data analysis. The analyses are performed by using a generalized linear model (glm) on data from one or multiple sources. It makes use of ds.glm(), a DataSHIELD function, that uses an approach that is mathematically equivalent to placing all individual-level data from all sources in one central warehouse and analysing those data using the conventional glm() function in R.
Figure 2.3: Non-disclosive omic data analysis with DataSHIELD and Bioconductor. The figure illustrates how to perform anlyses at genome-wide level from one or multiple sources. It runs standard Bioconductor functions at each server independently to speed up the analyses and in the case of having multiple sources, results can be meta-analyzed uning standar R functions.
Here, we will describe the data employed for the analyses presented in the bookdown, giving details about experimental designs, the type of features available, and the structure and organization of data in the Opal infrastructure. Employed data constitute both public data downloaded from public repositories, and real datasets deriving from an European H2020 research project on exposome and health (HELIX Project). The public datasets have been uploaded to the Opal demo site and can be easily accessed without additional permissions. On the other hand, the HELIX dataset is available in the Opal BRGE site hosted by the Bioinformatic Research Group in Epidemiology of ISGlobal. In both cases, data are organized in the Opal simulating a single-site DataSHIELD infrastructure. Details for accessing the server will be found in each section, at the beginning of each pipeline.
Data from the public CINECA project were employed in this bookdown for demonstrating the capabilities of dsOmicSHIELD for performing GWAS and other genetic analyses. Data correspond to synthetic data set generated by the CINECA Project made freely available under the Creative Commons Licence (CC-BY) (funding: EC H2020 grant 825775). In Figure 3.1, an overall description of the datasets and cohorts belonging to the CINECA project is presented.
Figure 3.1: The figure depicts CINECA project cohorts representing >1.4 Million individuals and spanning three continents.
Data were initially accessible from a public EGA repository (ID: EGAD00001006673), being a deliverable of the VEIS project. Previous to analysis, data were pruned, and the individuals were separated into three synthetic cohorts. Information on individuals and SNP count of each cohort can be found on the Table 3.1.
| Cohort 1 | Cohort 2 | Cohort 3 | Total | |
|---|---|---|---|---|
| Number of SNPs | 865,240 | 865,240 | 865,240 | 865,240 |
| Number of individuals | 817 | 1,073 | 614 | 2,504 |
The three datasets were uploaded to the Opal demo site organized as (1) an array of resource objects, corresponding to unitary VCF files with variant information (one per chromosome and cohort), and (2) table objects corresponding to the phenotype data (one per cohort). Data are in Opal simulating a single-site DataSHIELD architecture, as in Figure 3.2. In this figure, the data analyst corresponds to the “RStudio” session, which through the DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network.
Figure 3.2: Proposed infrastructure to perform single-cohort GWAS studies.
For more information about DataSHIELD data formats, please, refers to the bookdown “Orchestrating privacy-protected non-disclosive big data analyses of data from different resources with R and DataSHIELD”.
The public TCGA Liver project was selected for illustrating how to perform transcriptomic data analysis in DataSHIELD. These data derived from the famous TCGA project. We have uploaded to the demo Opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata available through the recount project.
Some summaries of the dataset are the following:
| TCGA Liver data | |
|---|---|
| Number of individuals | 424 |
| Number of genes | 58,037 |
| Number of covariate fields | 864 |
The RangedSummarizedExperiment (RSE) objects are part of the BioConductor project, as a brief summary, these types of objects are designed to contain different multi-omic data, they can also contain the individuals phenotypes and additional metadata associated with a particular experiment. For further information about what are RangedSummarizedExperiment please take a look at the official documentation
The resource used on this project contains a RangedSummarizedExperiment with the RNAseq profiling of liver cancer data from TCGA. The resource is located inside the OMICS project. The structure used is illustrated on the following figure.
(#fig:dgeProposal_1)Proposed infrastructure to perform DGE studies.
The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server. This Opal server contains a resource that corresponds to the TCGA liver RSE.
The public data from the GSE66351 project were used in this section for illustrating how to perform EWAS in DataSHIELD. They has been downloaded from GEO (accession number GSE66351) and consist on DNA methylation reads from the Illumina 450K array. Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s.
This kind of data is encapsulated on a type of R object called ExpressionSet, this objects are part of the BioConductor project and are meant to contain different sources of genomic data, alongside the genomic data they can also contain the phenotypes and metadata associated to a study. Researchers who are not familiar with ExpressionSet can find further information here. Notice that genomic data is encoded as beta-values that ensure data harmonization across studies.
In order to illustrate how to perform data analyses using federated data, we have split the data into two synthetic cohorts (split by individuals). We have created two resources on the demo Opal server called GSE66351_1 and GSE66351_2 respectively. They can be found inside the OMICS project. The structure used is illustrated on the following figure.
(#fig:ewasProposal_1)Proposed infrastructure to perform EWAS studies.
Some summaries of the datasets are the following:
| Cohort 1 | Cohort 2 | Total | |
|---|---|---|---|
| Number of CpGs | 481,868 | 481,868 | 481,868 |
| Number of individuals | 100 | 90 | 190 |
| Number of covariates | 49 | 49 | 49 |
| Number of annotation features | 37 | 37 | 37 |
The real-world research problem presented in this book comes from the HELIX Project. The HELIX (Human Early-Life Exposome) project gathers data from 6 longitudinal-based European birth cohorts, with the aim of evaluating the effect of environmental risk factors on mothers and children health, with a special focus in the effects on molecular health profiles (omic data). HELIX cohorts include the BIB (Born in Bradford) (United Kingdom), EDEN (Étude des Déterminants pré et postnatals du développement et de la santé de l’ENfant) (France), INMA (INfancia y Medio Ambiente) (Spain), KANC (Kaunus Cohort) (Lithuania), MoBa (Norwegian Mother and Child Cohort Study) (Norway), and Rhea (Mother-Child Cohort in Crete) (Greece). General details of the study design of HELIX can be found in the Figure 3.3. The whole HELIX dataset includes a total of 31,472 mother-child pairs. Among them, a subcohort of 1,301 children (approximately 200 children in each cohort) was selected here according to the following criteria of eligibility: 1) age 6 to 11 years; 2) omic data available (including gene expression and EWAS data); 3) complete address history; and 4) no serious health problems that may affect the clinical testing or the child safety.
Figure 3.3: General overview of the HELIX research project.
In the 1,301 children, a wide range of environmental exposures were evaluated to define the early-life exposome during two time periods: the prenatal pregnancy period and childhood (age 6 to 11 years). Collected exposures can be organized into three main parts of the exposome: outdoor exposures, chemical exposures measured through biomarkers, and lifestyle factors. All variables incorporated in the dataset have been appropriately pre-processed previous to analysis (outliers removal, normalization, and missing values imputation). Regarding phenotype data, a wide range of health outcomes are also available for the childhood period including phenotypes related to (1) obesity and cardiometabolic health, (2) respiratory health, and (3) cognition and mental health. Finally, omic data (mainly transcriptomic and genome-wide DNA methylation data) are also available in the childhood period. The objective of this dataset was illustrating a real case study of:
Omic data from the HELIX project were uploaded into to the Opal BRGE site simulating a single-site DataSHIELD architecture, as illustrated in Figure 3.4. Data were stored under the form of resources (one per cohort and data type). As illustrated in the figure, transcriptomic data) were stored under the form of expressionSets while EWAS data were stored as GenomicRatioSet. A summary of the structure and organization of this data in the Opal for the HELIX project is illustrated in Figure 3.4.
Figure 3.4: Input HELIX datasets organized by cohort in the Opal server.
In the next subsection, we present extra details for each of these datasets.
The gene expression data available for the HELIX cohorts correspond to microarray data deriving from the Human Transcriptome Array 2.0 platform of Affymetrix. These are hosted in the the Opal BRGE site as ExpressionSet objects. Particularly, one ExpressionSet file per cohort is hosted in the Opal BRGE site in the form of a resource, as presented in the Figure @ref(fig:FigHELIXresourcesOpal_transcrip). As it can be seen, data are organized within the Opal server in a project called HELIX_omics.
(#fig:FigHELIXresourcesOpal_transcrip)Input ExposomeSet data organized by cohort in the Opal server as resources.
The epigenetic data available for the HELIX cohorts correspond to genome-wide DNA methylation data obtained by microarray experiments using the Infinium HumanMethylation450k platform from Illumina. The data format in which these data are hosted in the Opal is the type GenomicRatioSet. GenomicRatioSet is a extend of the class SummarizedExperiment especially designed for DNA methylation data obtained with the Infinium HumanMethylation450k platform, that will be analysed with the associated R package minfi. This type of data usually contain pre-processed DNA methylation values at a genome-wide level, in the form of M or/and Beta values, together with the associated genomic coordinates. As in the case of ExpressionSet types, GenomicRatioSet can also incorporate phenotype and metadata information.
Particularly, one GenomicRatioSet file per cohort is hosted in the Opal BRGE site in the form of a resource, as presented in the Figure @ref(fig:FigHELIXresourcesOpal_DNAm). As it can be seen, data are organized within the Opal server in a project called HELIX_omics.
(#fig:FigHELIXresourcesOpal_DNAm)Input GenomicRatioSet data organized by cohort in the Opal server as resources.
|
⚠️ RESOURCES USED ALONG THIS SECTION |
|||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
|||||||||||||||||||||
|
Along this section we will illustrate how to get some information about the loaded genotype data and how to perform some basic exploration analysis. Three different aspects will be illustrated:
Information on the data used can be found on the Section (#CINECAopal). We will be working on a virtual scenario with three different cohorts to visualize a real scenario. The same functions can be used in the case of a single cohort.
The study configuration for this section is the following:
Figure 4.1: Proposed infrastructure.
We have to create an Opal connection object to the different cohorts server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort3", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
It is important to note that in this use case, we are only using one server that contains all the resources (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. On a more real scenario each one of the builder$append instructions would be connecting to a different server.
Now that we have created a connection object to the different Opals, we have started three R session, our analysis will take place on those remote sessions, so we have to load the data into them.
In this use case we will use 63 different resources from the GWAS project hosted on the demo Opal server. This resources correspond to VCF files with information on individual chromosomes (21 chromosomes per three cohorts). The names of the resources are chrXXY (where XX is the chromosome number and Y is the cohort A/B/C). Following the Opal syntax, we will refer to them using the string GWAS.chrXXY.
We have to refer specifically to each different server by using conns[X], this allows us to communicate with the server of interest to indicate to it the resources that it has to load.
# Cohort 1 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[1], paste0("chr", x), paste0("GWAS.chr", x,"A"))
})
# Cohort 2 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[2], paste0("chr", x), paste0("GWAS.chr", x,"B"))
})
# Cohort 3 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[3], paste0("chr", x), paste0("GWAS.chr", x,"C"))
})
Now we have assigned all the resources named GWAS.chrXXY into our remote R session. We have assigned them to the variables called chrXX. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote sessions.
ds.class("chr1")
$cohort1
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
$cohort2
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
$cohort3
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
We can see that the object chr1 exists in all the three servers.
Finally the resources are resolved to retrieve the data in the remote sessions.
lapply(1:21, function(x){
DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})
Now we have resolved the resources named chrXX into our remote R sessions. The objects retrieved have been assigned into variables named gdsXX_object. We can check the process was successful as we did before.
ds.class("gds1_object")
$cohort1
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
$cohort2
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
$cohort3
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
The objects we have loaded into our remote sessions are VCF files that contain genomic information of the individuals. To perform a GWAS this information has to be related to some phenotypes to extract relationships. Therefore, we need to load the phenotypes into the remote sessions. The phenotypes information is a table that contains the individuals as rows and phenotypes as columns. In this use case, we will use a resource (as with the VCF files) to load the phenotypes table into the remote sessions.
As with the VCF resources, here we load a different resource to each cohort, as it is expected that each cohort has only the phenotypes information regarding their individuals, therefore different tables. Note that we are assigning the resource to the same variable at each server (pheno), so we only need one function call to resolve all the resources together.
# Cohort 1 phenotypes table
DSI::datashield.assign.resource(conns[1], "pheno", "GWAS.ega_phenotypes_1")
# Cohort 2 phenotypes table
DSI::datashield.assign.resource(conns[2], "pheno", "GWAS.ega_phenotypes_2")
# Cohort 3 phenotypes table
DSI::datashield.assign.resource(conns[3], "pheno", "GWAS.ega_phenotypes_3")
# Resolve phenotypes table
DSI::datashield.assign.expr(conns = conns, symbol = "pheno_object",
expr = quote(as.resource.data.frame(pheno)))
We can follow the same analogy as before to know that we have assigned the phenotypes table to a variable called pheno_object on the remote R session.
ds.class("pheno_object")
$cohort1
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
$cohort2
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
$cohort3
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can also check the column names to see which information is present on the table.
ds.colnames("pheno_object")[[1]]
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "subject_id" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
Arrived at this point, we have 21 VCF objects at each cohort R session (each one corresponds to a chromosome) and a phenotypes table. The next step is merging each of the VCF objects with the phenotypes table. The same procedure as the single cohort can be applied to extract the arguments required.
With all this information we can now merge the phenotypes and VCF objects into a type of object named GenotypeData. We will use the ds.GenotypeData function.
lapply(1:21, function(x){
ds.GenotypeData(x=paste0('gds', x,'_object'), covars = 'pheno_object',
columnId = "subject_id", sexId = "sex",
male_encoding = "male", female_encoding = "female",
case_control_column = "diabetes_diagnosed_doctor",
case = "Yes", control = "No",
newobj.name = paste0('gds.Data', x), datasources = conns)
})
The objects we just created are named gds.DataXX on the remote session. Now we are ready to perform the exploration analysis on the data we just loaded on the remote sessions.
Once we have everything loaded in the remote R sessions, we can begin a simple exploratory analysis by checking basic information. We can check the following:
To get the number of variants and individuals:
dimensions <- do.call(rbind, lapply(1:21, function(x){
data.frame(ds.genoDimensions(paste0('gds.Data', x)))
}))
data.frame(dimensions)
cohort1.snp_number cohort1.scan_number cohort1.chromosomes
1 69806 817 1
2 74985 817 2
3 53164 817 3
4 56259 817 4
5 51985 817 5
6 54835 817 6
7 53037 817 7
8 50372 817 8
9 40920 817 9
10 47075 817 10
11 44979 817 11
12 43477 817 12
13 33411 817 13
14 29742 817 14
15 26905 817 15
16 29183 817 16
17 24546 817 17
18 26353 817 18
19 20408 817 19
20 20771 817 20
21 13079 817 21
cohort2.snp_number cohort2.scan_number cohort2.chromosomes
1 69806 1073 1
2 74985 1073 2
3 53164 1073 3
4 56259 1073 4
5 51985 1073 5
6 54835 1073 6
7 53037 1073 7
8 50372 1073 8
9 40920 1073 9
10 47075 1073 10
11 44979 1073 11
12 43477 1073 12
13 33411 1073 13
14 29742 1073 14
15 26905 1073 15
16 29183 1073 16
17 24546 1073 17
18 26353 1073 18
19 20408 1073 19
20 20771 1073 20
21 13079 1073 21
cohort3.snp_number cohort3.scan_number cohort3.chromosomes
1 69806 557 1
2 74985 557 2
3 53164 557 3
4 56259 557 4
5 51985 557 5
6 54835 557 6
7 53037 557 7
8 50372 557 8
9 40920 557 9
10 47075 557 10
11 44979 557 11
12 43477 557 12
13 33411 557 13
14 29742 557 14
15 26905 557 15
16 29183 557 16
17 24546 557 17
18 26353 557 18
19 20408 557 19
20 20771 557 20
21 13079 557 21
To get the available phenotypes:
ds.varLabels('gds.Data1')
$cohort1
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "scanID" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
$cohort2
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "scanID" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
$cohort3
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "scanID" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
attr(,"class")
[1] "dsvarLabels" "list"
To get the SNP rs IDs:
head(ds.getSNPs('gds.Data1'))
[1] "rs575272151" "rs75454623" "rs2691315" "rs2691277" "rs114420996"
[6] "rs62637817"
We have also developed functions to compute statistics of the genotype data, the statistics implemented on dsOmicsClient are:
To get the allele frequencies we have two different methods, to obtain the pooled frequencies or to obtain them by study server:
# Pooled allele frequencies
ds.alleleFrequency(paste0('gds.Data', 1:21))
# A tibble: 762,900 x 3
rs n pooled_MAF
<chr> <dbl> <dbl>
1 rs575272151 2435 0.0863
2 rs75454623 2428 0.482
3 rs2691315 2429 0.409
4 rs75478250 2432 0.355
5 rs940550 2429 0.297
6 rs144169752 2431 0.120
7 rs368130516 2429 0.0593
8 rs202206030 2433 0.0724
9 rs2853819 2428 0.202
10 rs202025951 2429 0.184
# ... with 762,890 more rows
# By-study allele frquencies
ds.alleleFrequency(paste0('gds.Data', 1:21), type = "split")
$cohort1
# A tibble: 853,933 x 3
rs n MAF
* <chr> <dbl> <dbl>
1 rs575272151 813 0.0795
2 rs75454623 812 0.476
3 rs2691315 812 0.432
4 rs2691277 813 0.0861
5 rs114420996 812 0.131
6 rs62637817 812 0.0686
7 rs75478250 815 0.348
8 rs940550 812 0.314
9 rs377100675 814 0.259
10 rs62642131 812 0.137
# ... with 853,923 more rows
$cohort2
# A tibble: 855,153 x 3
rs n MAF
* <chr> <dbl> <dbl>
1 rs575272151 1069 0.0965
2 rs75454623 1064 0.491
3 rs2691315 1065 0.433
4 rs2691277 1066 0.0982
5 rs114420996 1064 0.120
6 rs62637817 1064 0.0543
7 rs75478250 1064 0.364
8 rs940550 1065 0.336
9 rs377100675 1066 0.273
10 rs62642131 1063 0.129
# ... with 855,143 more rows
$cohort3
# A tibble: 778,123 x 3
rs n MAF
* <chr> <dbl> <dbl>
1 rs575272151 553 0.0753
2 rs75454623 552 0.477
3 rs2691315 552 0.331
4 rs75478250 553 0.349
5 rs940550 552 0.199
6 rs144169752 553 0.132
7 rs368130516 553 0.0628
8 rs202206030 553 0.0558
9 rs2853819 553 0.236
10 rs202025951 554 0.117
# ... with 778,113 more rows
attr(,"class")
[1] "dsalleleFrequency" "list"
To do the Hardy-Weinberg Equilibrium testing is not implemented to get the pooled results as the allele frequency, we can only get the results from each server independently.
do.call(rbind, lapply(1:21, function(x){
as_tibble(ds.exactHWE(paste0('gds.Data', x)))
}))
# A tibble: 865,292 x 3
cohort1$rs $chr $pval cohort2$rs $chr $pval cohort3$rs $chr
<chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
1 rs575272151 1 8.10e- 1 rs575272151 1 2.89e- 1 rs575272151 1
2 rs75454623 1 5.90e-115 rs75454623 1 1.22e-160 rs75454623 1
3 rs2691315 1 8.30e- 1 rs2691315 1 3.82e- 1 rs2691315 1
4 rs2691277 1 4.02e- 2 rs2691277 1 2.67e- 3 rs2691277 1
5 rs114420996 1 8.78e- 1 rs114420996 1 1.00e+ 0 rs114420996 1
6 rs62637817 1 2.85e- 5 rs62637817 1 2.94e- 2 rs62637817 1
7 rs75478250 1 4.45e- 2 rs75478250 1 4.27e- 1 rs75478250 1
8 rs940550 1 4.36e- 4 rs940550 1 2.32e- 2 rs940550 1
9 rs377100675 1 5.79e- 43 rs377100675 1 3.34e- 57 rs377100675 1
10 rs62642131 1 4.16e- 5 rs62642131 1 2.01e- 9 rs62642131 1
# ... with 865,282 more rows
We have implemented a pooled component analysis method into OmicSHIELD. This function will create new phenotypes on the GenotypeData objects with the principal components results, so we can later adjust our analysis.
pca_res <- ds.PCA(paste0('gds.Data', 1:21), standardize = F)
The principal component results can also be visualized using the available grouping variables (categorical) present on the GenotypeData objects used on the ds.PCA function.
dsOmicsClient::plotPCA(pca_res, group = "sex")
Or we can visualize the results with no groupings.
dsOmicsClient::plotPCA(pca_res)
datashield.logout(conns)
Three different use cases will be illustrated in this section:
For this, we will be using data from the CINECA project presented in the section 3.1. For all the illustrated cases we will use data divided by chromosome (VCF files). The procedure is the same if all the variants are on a single VCF. The phenotype information used throughout this section is not contained inside the VCF files mentioned previously, it is contained using traditional CSV/Excel files. This is to replicate the typical scenario where an investigator receives the genomic and phenotype data separated and has to merge it to study a specific relationship between the gene expression / variants and a certain phenotype.
|
⚠️ RESOURCES USED ALONG THIS SECTION |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
|||||||||
|
The single cohort analysis is a way of performing a GWAS study guaranteeing GDPR data confidentiality. The structure followed is illustrated on the following figure.
Figure 5.1: Proposed infrastructure to perform single-cohort GWAS studies.
The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network. This Opal server contains an array of resources that correspond to the different VCF files (sliced by chromosome)1 and a resource that corresponds to the phenotypes table of the studied individuals.
We have to create an Opal connection object to the cohort server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
Now that we have created a connection object to the Opal, we have started a new R session on the server, our analysis will take place in this remote session, so we have to load the data into it.
In this use case we will use 21 different resources from the GWAS project hosted on the demo Opal server. This resources correspond to VCF files with information on individual chromosomes. The names of the resources are chrXX (where XX is the chromosome number). Following the Opal syntax, we will refer to them using the string GWAS.chrXX.
To load the resources we will use the DSI::datashield.assign.resource() function. Note that along the use case we use the lapply function to simplify our code since we have to perform repetitive operations. lapply is a way of creating loops in R, for more information visit the documentation.
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns, paste0("chr", x), paste0("GWAS.chr", x))
})
Now we have assigned all the resources named GWAS.chrXX into our remote R session. We have assigned them to the variables called chrXX. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote session.
ds.class("chr1")
$cohort1
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
To resolve the resources and retrieve the data in the remote session we will use the DSI::datashield.assign.expr() function. This function runs a line of code on the remote session,2 and in particular we want to run the function as.resource.object(), which is the DataSHIELD function in charge of resolving the resources.
lapply(1:21, function(x){
DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})
Now we have resolved the resources named chrXX into our remote R session. The objects retrieved have been assigned into variables named gdsXX_object. We can check the process was successful as we did before.
ds.class("gds1_object")
$cohort1
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
The objects we have loaded into our remote session are VCF files that contain genomic information of the individuals. To perform a GWAS this information has to be related to some phenotypes to extract relationships. Therefore, we need to load the phenotypes into the remote session. The phenotypes information is a table that contains the individuals as rows and phenotypes as columns. In this use case, we will use a resource (as with the VCF files) to load the phenotypes table into the remote session.
The procedure is practically the same as before with some minor tweaks. To begin, we have the phenotypes information in a single table (hence, a single resource). Then, instead of using the function as.resource.object, we will use as.resource.data.frame, this is because before we were loading a special object (VCF file) and now we are loading a plain table, so the internal treatment on the remote session has to be different.
DSI::datashield.assign.resource(conns, "pheno", "GWAS.ega_phenotypes")
DSI::datashield.assign.expr(conns = conns, symbol = "pheno_object",
expr = quote(as.resource.data.frame(pheno)))
We can follow the same analogy as before to know that we have assigned the phenotypes table to a variable called pheno_object on the remote R session.
ds.class("pheno_object")
$cohort1
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can also check the column names to see which information is present on the table.
ds.colnames("pheno_object")[[1]]
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "subject_id" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
Arrived at this point, we have 21 VCF objects (each one corresponds to a chromosome) and a phenotypes table on our remote session. The next step is merging each of the VCF objects with the phenotypes table. Before doing that however, we have to gather some information from the phenotypes table. The information to gather is summarized on the Table 5.1.
| Information | Details |
|---|---|
| Which column has the samples identifier? | Column name that contains the IDs of the samples. Those are the IDs that will be matched to the ones present on the VCF objects. |
| Which is the sex column on the covariates file? | Column name that contains the sex information. |
| How are males and how are females encoded into this column? | There is not a standard way to encode the sex information, some people use 0/1; male/female; etc. Our approach uses a library that requires a specific encoding, for that reason we need to know the original encoding to perform the translation. |
| Which is the case-control column? | Column name that contains the case/control to be studied. |
| What is the encoding for case and for control on this column? | Similar as the sex column, the case/control will be translated to the particular standard of the software we have used to develop our functionalities3 . |
If we are not sure about the exact column names, we can use the function ds.colnames as we did before. Also, we can use the function ds.table1D to check the level names of the categorical variables.
ds.table1D("pheno_object$sex")$counts
pheno_object$sex
female 1271
male 1233
Total 2504
ds.table1D("pheno_object$diabetes_diagnosed_doctor")$counts
pheno_object$diabetes_diagnosed_doctor
Do_not_know 612
No 632
Prefer_not_to_answer 590
Yes 668
Total 2502
From the different functions we used, we can extract our answers (Summarized on the Table 5.2)
| Question | Answer |
|---|---|
| Which column has the samples identifier? | “subject_id” |
| Which is the sex column on the covariates file? | “sex” |
| How are males and how are females encoded into this column? | “male”/“female” |
| Which is the case-control column of interest of the covariates? | “diabetes_diagnosed_doctor” |
| What is the encoding for case and for control on this column? | “Yes”/“No” |
With all this information we can now merge the phenotypes and VCF objects into a type of object named GenotypeData. We will use the ds.GenotypeData function.
lapply(1:21, function(x){
ds.GenotypeData(x=paste0('gds', x,'_object'), covars = 'pheno_object',
columnId = "subject_id", sexId = "sex",
male_encoding = "male", female_encoding = "female",
case_control_column = "diabetes_diagnosed_doctor",
case = "Yes", control = "No",
newobj.name = paste0('gds.Data', x), datasources = conns)
})
The objects we just created are named gds.DataXX on the remote session. Now we are ready to perform a GWAS analysis. To perform a GWAS we have to supply this “Genotype Data” objects and some sort of formula in which we will specify the type of association we are interested on studying. The R language has it’s own way of writing formulas, a simple example would be the following (relate the condition to the smoke variable adjusted by sex):
condition ~ smoke + sex
For this use case we will use the following formula.
diabetes_diagnosed_doctor ~ sex + hdl_cholesterol
We are finally ready to achieve our ultimate goal, performing a GWAS analysis. We have our data (genotype + phenotypes) organized into the correspondent objects (GenotypeData) and our association formula ready. The function ds.metaGWAS is in charge of doing the analysis, we have to supply it with the names of the GenotypeData objects of the remote R session and the formula. Since we have 21 different objects, the paste0 function is used to simplify the call.
results_single <- ds.metaGWAS(genoData = paste0("gds.Data", 1:21),
model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol)[[1]]
results_single
# A tibble: 865,240 x 10
rs chr pos p.value n.obs freq Est Est.SE ref_allele alt_allele
* <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 rs4648~ 1 2.89e6 7.41e-6 1292 0.512 -0.323 0.0721 C T
2 rs4659~ 1 1.20e8 9.05e-6 1296 0.587 0.321 0.0723 T C
3 rs1274~ 1 2.90e6 9.72e-6 1294 0.548 -0.321 0.0725 C T
4 rs3137~ 1 8.62e7 9.93e-6 1293 0.277 -0.376 0.0852 C T
5 rs1132~ 1 1.17e8 1.11e-5 1292 0.859 -0.459 0.104 T C
6 rs1736~ 1 2.82e7 2.26e-5 1293 0.864 -0.466 0.110 C T
7 rs2920~ 1 2.30e7 2.67e-5 1293 0.343 0.327 0.0779 G A
8 rs7411~ 1 1.47e8 3.07e-5 1294 0.932 0.609 0.146 A G
9 rs5800~ 1 5.30e7 3.69e-5 1293 0.929 0.590 0.143 C A
10 rs7290~ 1 5.30e7 4.13e-5 1294 0.929 0.586 0.143 G A
# ... with 865,230 more rows
We can display the results of the GWAS using a Manhattan plot.
manhattan(results_single)
(#fig:singleCohortManhattan_plot)Manhattan plot of the single-cohort results.
Moreover, we can visualize the region with the lowest p-value in detail using the function LocusZoom. There are multiple arguments to control the gene annotation, display a different region of the GWAS, display a different range of positions and more. Make sure to check ?LocusZoom for all the details.
LocusZoom(results_single, use_biomaRt = FALSE)
Figure 5.2: LocusZoom plot of the surrounding region of the top hit SNP.
datashield.logout(conns)
|
⚠️ RESOURCES USED ALONG THIS SECTION |
|||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
|||||||||||||||||||||
|
Now all the individuals used for the single cohort use case will be divided into three different synthetic cohorts. The results from each cohort will then be meta-analyzed to be compared to the results obtained by studying all the individuals together. As with the single-cohort methodology illustrated on the prior section, this method guarantees GDPR data confidentiality. The structure for a three cohort study is illustrated on the following figure (note this can be extrapolated for cohorts with a bigger (or lower) partner count).
Figure 5.3: Proposed infrastructure to perform multi-cohort GWAS studies.
The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server located at the cohort network. This Opal server contains an array of resources that correspond to the different VCF files (sliced by chromosome)4 and a resource that corresponds to the phenotypes table of the studied individuals.
This use case will portray a complete study methodology, more precisely we will follow this flowchart:
Figure 5.4: Flowchart to perform multi-cohort GWAS studies. Adapted from Winkler et al. (2014)
Brief description of each step:
We have to create an Opal connection object to the different cohorts server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort3", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
It is important to note that in this use case, we are only using one server that contains all the resources (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. On a more real scenario each one of the builder$append instructions would be connecting to a different server.
Now that we have created a connection object to the different Opals, we have started three R session, our analysis will take place on those remote sessions, so we have to load the data into them.
In this use case we will use 63 different resources from the GWAS project hosted on the demo Opal server. This resources correspond to VCF files with information on individual chromosomes (21 chromosomes per three cohorts). The names of the resources are chrXXY (where XX is the chromosome number and Y is the cohort A/B/C). Following the Opal syntax, we will refer to them using the string GWAS.chrXXY.
We have to refer specifically to each different server by using conns[X], this allows us to communicate with the server of interest to indicate to it the resources that it has to load.
# Cohort 1 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[1], paste0("chr", x), paste0("GWAS.chr", x,"A"))
})
# Cohort 2 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[2], paste0("chr", x), paste0("GWAS.chr", x,"B"))
})
# Cohort 3 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[3], paste0("chr", x), paste0("GWAS.chr", x,"C"))
})
Now we have assigned all the resources named GWAS.chrXXY into our remote R session. We have assigned them to the variables called chrXX. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote sessions.
ds.class("chr1")
$cohort1
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
$cohort2
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
$cohort3
[1] "GDSFileResourceClient" "FileResourceClient" "ResourceClient"
[4] "R6"
We can see that the object chr1 exists in all the three servers.
Finally the resources are resolved to retrieve the data in the remote sessions.
lapply(1:21, function(x){
DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})
Now we have resolved the resources named chrXX into our remote R sessions. The objects retrieved have been assigned into variables named gdsXX_object. We can check the process was successful as we did before.
ds.class("gds1_object")
$cohort1
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
$cohort2
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
$cohort3
[1] "GdsGenotypeReader"
attr(,"package")
[1] "GWASTools"
The objects we have loaded into our remote sessions are VCF files that contain genomic information of the individuals. To perform a GWAS this information has to be related to some phenotypes to extract relationships. Therefore, we need to load the phenotypes into the remote sessions. The phenotypes information is a table that contains the individuals as rows and phenotypes as columns. In this use case, we will use a resource (as with the VCF files) to load the phenotypes table into the remote sessions.
As with the VCF resources, here we load a different resource to each cohort, as it is expected that each cohort has only the phenotypes information regarding their individuals, therefore different tables. Note that we are assigning the resource to the same variable at each server (pheno), so we only need one function call to resolve all the resources together.
# Cohort 1 phenotypes table
DSI::datashield.assign.resource(conns[1], "pheno", "GWAS.ega_phenotypes_1")
# Cohort 2 phenotypes table
DSI::datashield.assign.resource(conns[2], "pheno", "GWAS.ega_phenotypes_2")
# Cohort 3 phenotypes table
DSI::datashield.assign.resource(conns[3], "pheno", "GWAS.ega_phenotypes_3")
# Resolve phenotypes table
DSI::datashield.assign.expr(conns = conns, symbol = "pheno_object",
expr = quote(as.resource.data.frame(pheno)))
We can follow the same analogy as before to know that we have assigned the phenotypes table to a variable called pheno_object on the remote R session.
ds.class("pheno_object")
$cohort1
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
$cohort2
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
$cohort3
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can also check the column names to see which information is present on the table.
ds.colnames("pheno_object")[[1]]
[1] "age_cancer_diagnosis" "age_recruitment"
[3] "alcohol_day" "basophill_count"
[5] "behaviour_tumour" "bmi"
[7] "c_reactive_protein_reportability" "record_origin"
[9] "report_format" "cholesterol"
[11] "birth_country" "actual_tobacco_smoker"
[13] "date_diagnosis" "diabetes_diagnosed_doctor"
[15] "diastolic_blood_pressure" "duration_moderate_activity"
[17] "duration_vigorous_activity" "energy"
[19] "eosinophill_count" "ethnicity"
[21] "ever_smoked" "fasting_time"
[23] "alcohol_frequency" "hdl_cholesterol"
[25] "hip_circumference" "histology_tumour"
[27] "home_e_coordinate" "home_n_coordinate"
[29] "e_house_score" "s_house_score"
[31] "w_house_score" "e_income_score"
[33] "w_income_score" "ldl_direct"
[35] "lymphocyte_count" "monocyte_count"
[37] "neutrophill_count" "n_accidental_death_close_gen_family"
[39] "household_count" "children_count"
[41] "operations_count" "operative_procedures"
[43] "past_tobacco" "platelet_count"
[45] "pulse_rate" "qualification"
[47] "cancer_occurrences" "reticulocuyte_count"
[49] "sleep_duration" "height"
[51] "subject_id" "triglycerides"
[53] "cancer_type_idc10" "cancer_type_idc9"
[55] "weight" "leukocyte_count"
[57] "sex" "phenotype"
[59] "age_death" "age_diagnosis_diabetes"
[61] "date_k85_report" "date_k86_report"
[63] "date_death" "enm_diseases"
[65] "source_k85_report" "source_k86_report"
[67] "age_hbp_diagnosis" "mental_disorders"
[69] "respiratory_disorders" "cancer_year"
[71] "circulatory_disorders" "digestive_disorders"
[73] "nervous_disorders" "immune_disorders"
Arrived at this point, we have 21 VCF objects at each cohort R session (each one corresponds to a chromosome) and a phenotypes table. The next step is merging each of the VCF objects with the phenotypes table. The same procedure as the single cohort can be applied to extract the arguments required.
With all this information we can now merge the phenotypes and VCF objects into a type of object named GenotypeData. We will use the ds.GenotypeData function.
lapply(1:21, function(x){
ds.GenotypeData(x=paste0('gds', x,'_object'), covars = 'pheno_object',
columnId = "subject_id", sexId = "sex",
male_encoding = "male", female_encoding = "female",
case_control_column = "diabetes_diagnosed_doctor",
case = "Yes", control = "No",
newobj.name = paste0('gds.Data', x), datasources = conns)
})
The objects we just created are named gds.DataXX on the remote session. Now we are ready to perform a GWAS analysis. We will use the same formula as the single cohort use case to be able to compare the results later:
diabetes_diagnosed_doctor ~ sex + hdl_cholesterol
We now proceed to perform the GWAS analysis.
results <- ds.metaGWAS(genoData = paste0("gds.Data", 1:21),
model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol)
And we can see the results for each cohort.
# A tibble: 865,162 x 10
rs chr pos p.value n.obs freq Est Est.SE ref_allele alt_allele
* <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 rs5795~ 1 3.90e7 4.11e-7 430 0.737 0.724 0.143 T C
2 rs1132~ 1 1.17e8 5.21e-7 428 0.876 -0.943 0.188 T C
3 rs4970~ 1 3.90e7 5.72e-6 350 0.749 0.740 0.163 A G
4 rs9726~ 1 2.81e7 7.97e-6 430 0.801 0.695 0.156 C T
5 rs2906~ 1 4.43e7 8.97e-6 428 0.551 -0.562 0.127 A G
6 rs1211~ 1 1.87e8 6.94e-6 428 0.407 -0.607 0.135 T A
7 rs1626~ 1 2.06e8 1.07e-5 428 0.215 -0.693 0.157 C A
8 rs1090~ 1 2.81e7 1.04e-5 430 0.8 0.681 0.155 G A
9 rs1938~ 1 1.87e8 1.15e-5 429 0.597 0.590 0.134 A T
10 rs1572~ 1 1.99e7 1.27e-5 428 0.189 -0.739 0.169 T C
# ... with 865,152 more rows
# A tibble: 865,190 x 10
rs chr pos p.value n.obs freq Est Est.SE ref_allele alt_allele
* <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 rs9426~ 1 2.98e7 5.14e-6 549 0.756 -0.665 0.146 C T
2 rs1240~ 1 8.57e7 8.87e-5 549 0.934 0.918 0.234 A G
3 rs2762~ 1 1.45e8 1.10e-4 550 0.52 1.18 0.305 T C
4 rs1116~ 1 8.58e7 1.81e-4 549 0.779 0.522 0.139 T C
5 rs4657~ 1 1.67e8 2.76e-4 552 0.343 -0.414 0.114 G T
6 rs2295~ 1 1.60e8 2.94e-4 549 0.906 0.715 0.197 C A
7 rs1891~ 1 1.16e8 2.92e-4 550 0.725 0.500 0.138 T C
8 rs7768~ 1 2.21e8 3.03e-4 551 0.928 -0.802 0.222 C G
9 rs1209~ 1 3.13e7 3.54e-4 551 0.819 -0.548 0.153 C T
10 rs6670~ 1 1.16e8 4.08e-4 552 0.724 0.487 0.138 T C
# ... with 865,180 more rows
# A tibble: 849,560 x 10
rs chr pos p.value n.obs freq Est Est.SE ref_allele alt_allele
* <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 rs4915~ 1 6.22e7 2.72e-5 285 0.593 -0.700 0.167 G A
2 rs7544~ 1 6.22e7 2.87e-5 285 0.593 -0.700 0.167 T C
3 rs1385~ 1 1.02e8 3.85e-5 286 0.893 1.09 0.264 A G
4 rs1214~ 1 1.72e8 3.76e-5 285 0.909 1.17 0.283 T G
5 rs7521~ 1 1.02e8 4.21e-5 285 0.893 1.08 0.264 G A
6 rs1337~ 1 1.02e8 4.24e-5 285 0.893 1.08 0.264 A G
7 rs6005~ 1 2.45e8 7.08e-5 286 0.893 1.08 0.271 T A
8 rs1938~ 1 1.87e8 8.66e-5 285 0.760 -0.743 0.189 C G
9 rs1213~ 1 1.72e8 8.36e-5 285 0.919 1.16 0.295 G T
10 rs7616~ 1 1.72e8 7.93e-5 285 0.919 1.16 0.295 G C
# ... with 849,550 more rows
We can display the results of the GWAS using a Manhattan plot for each cohort.
manhattan(results[[1]])
Figure 5.5: Manhattan plot of the Cohort 1.
manhattan(results[[2]])
Figure 5.6: Manhattan plot of the Cohort 1.
manhattan(results[[3]])
Figure 5.7: Manhattan plot of the Cohort 1.
Up to this point we have collected the aggregated statistics from each cohort and we have visualized them. Before meta-analyzing this statistics we have to perform some quality control to make sure there are no major issues. There are three different methods bundled in dsOmicsClient to perform such task:
If any of these QC raises some inconsistencies, please contact the correspondent study analyst.
For this plot we are comparing the estimated allele frequencies to a reference set, so the first step is to get ourselves a reference set. For this example we will be using the ALL.wgs.phase1_release_v3.20101123 reference set. We will download it and load it into the client R session.
reference <- readr::read_tsv("https://homepages.uni-regensburg.de/~wit59712/easyqc/exomechip/ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.EUR_AF.txt.exm.tab.sexChr.DI.txt.GZ")
reference
# A tibble: 75,292 x 8
ExmName chrChr_Pos Chr Pos Rs Ref Alt Freq
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 exm2268640 chr1:762320 1 762320 rs75333668 C T 0.0013
2 exm47 chr1:865628 1 865628 rs41285790 G A 0.01
3 exm55 chr1:865694 1 865694 rs9988179 C T 0.0026
4 exm67 chr1:871159 1 871159 . G A 0.0013
5 exm106 chr1:874762 1 874762 rs139437968 C T 0.0013
6 exm158 chr1:878744 1 878744 . G C 0.0026
7 exm241 chr1:880502 1 880502 rs74047418 C T 0.0013
8 exm2253575 chr1:881627 1 881627 rs2272757 G A 0.63
9 exm269 chr1:881918 1 881918 rs35471880 G A 0.06
10 exm340 chr1:888659 1 888659 rs3748597 T C 0.95
# ... with 75,282 more rows
Now we just use the function eafPlot and we pass to it the GWAS results and the reference set.
eafPlot(x = results, reference = reference, rs_reference = "Rs", freq_reference = "Freq")
Figure 5.8: EAF Plot.
The obtained results are the expected, which is a linear correlation of slope one. This indicates the alleles are correctly encoded. For further explanation on how to assess the results produced by this plot, please refer to the Figure 4 of Winkler et al. (2014).
This plot will assess the issues with beta estimates, standard errors and P values. Here we will only need the results from the ds.GWAS function and pass them to the function pzPlot.
pzPlot(results)
Figure 5.9: P-Z Plot.
Please note that all the SNPs are plotted, for that reason there may be some performance issues if there are many SNPs to be plotted.
A dataset with no issues will show perfect concordance. For further details please refer to the Figure 3 of Winkler et al. (2014).
Finally, we will assess issues with trait transformations. For this we will only need the results from the ds.GWAS function and pass them to the function seNPlot.
seNPlot(results)
Figure 5.10: SE-N Plot.
If all the study points fall on a line, no issues are present. For further detail please refer to the Figure 2 of Winkler et al. (2014).
Given no problems have been detected on the meta-level QC, we will proceed. Up to this point, we have obtained association results for each cohort on the study. The next step is to combine this information using meta-analysis methods to derive a pooled estimate. Each researcher might have an already built pipeline to do so, or a preferred method; nevertheless, we included a couple methods inside dsOmicsClient. They are the following:
Previously, we have calculated the same GWAS analysis both using all the individuals and a simulated three-cohort scenario. This gives us the ability to assess how good or bad are the different meta-analysis methods; to do so, we will compare both meta-analysis to the complete data, the evaluation method will be counting the number of top hits that match, more specifically we will search for matches on the top twenty hits.
We apply both methods to the multi-cohort use case results:
meta_pvalues <- metaPvalues(results)
meta_bvalues <- metaBetaValues(results)
We can now compare the 20 top hits from the complete data (previously stored on the variable results_single) to the 20 top hits yielded by both meta-analysis methodologies.
# Extract 20 top hits for all three cases
# All individuals
top_20_complete <- results_single %>% arrange(p.value) %>% head(20)
# Pvalue meta-analysis
top_20_pval <- meta_pvalues %>% arrange(p.meta) %>% head(20)
# Beta meta-analysis
top_20_beta <- meta_bvalues %>% arrange(pval) %>% head(20)
# Number of hits shared by all individuals case and beta meta-analysis
sum(top_20_complete$rs %in% top_20_beta$rs)
[1] 13
# Number of hits shared by all individuals case and pvalue meta-analysis
sum(top_20_complete$rs %in% top_20_pval$rs)
[1] 1
The meta-analysis using beta values recovers a higher amount of top SNPs, therefore it can be considerate a more powerful method for GWAS meta-analyses.
We can use the manhattan plot and locus zoom to visualize the results of the meta-analysis as we visualized the single cohorts before. To help assess the meta-analysis behaviour, we will plot together the meta-analysis results and the complete data results.
manhattan(meta_bvalues)
Figure 5.11: Manhattan plot of the meta-analysis results.
manhattan(results_single)
Figure 5.12: Manhattan plot of the complete data.
LocusZoom(meta_bvalues, pvalue = "p.F", use_biomaRt = FALSE)
Figure 5.13: LocusZoom plot of the meta-analysis results.
LocusZoom(results_single, use_biomaRt = FALSE)
Figure 5.14: LocusZoom plot of the complete data.
The dsOmicsClient package also bundles a function to fit pooled generalized linear models (GLM) on single (or multiple) SNPs. On this section we will compare the results of the meta-analysis to the pooled GLM models to see which method is more accurate. The meta-analysis method will be the beta values (fixed-effects model).
The study methodology will be similar to the one we just performed. The top 20 hits yielded by the meta-analysis will be selected, those SNPs will be studied using the pooled GLM function. Both results will be compared to the complete data.
We aim to replicate a real scenario with this study, typically there is no access to the complete data (if so there is no point on performing meta-analysis on synthetic multi-cohorts), therefore we simulate the discovery of SNPs of interest via meta-analysis, then we want to assess if studying individually those SNPs using pooled GLM will put us closer to the results yielded by having the complete data.
We have already calculated the top 20 SNPs discovered by the beta meta-analysis.
top_20_beta
# A tibble: 20 x 10
rs chr pos b.F se.F p.F OR low up pval
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 rs124597~ 19 4.51e7 -0.533 0.0989 7.14e-8 0.587 0.484 0.713 7.14e-8
2 rs956025 12 1.19e8 -0.403 0.0850 2.15e-6 0.668 0.566 0.790 2.15e-6
3 rs8130070 21 4.57e7 0.398 0.0846 2.53e-6 1.49 1.26 1.76 2.53e-6
4 rs8130914 21 4.68e7 -0.881 0.191 4.16e-6 0.414 0.285 0.603 4.16e-6
5 rs177569~ 14 6.92e7 -0.613 0.134 5.06e-6 0.542 0.416 0.705 5.06e-6
6 rs9980256 21 4.24e7 -0.388 0.0852 5.20e-6 0.678 0.574 0.801 5.20e-6
7 rs124233~ 12 1.24e7 0.716 0.158 5.47e-6 2.05 1.50 2.79 5.47e-6
8 rs721891 19 5.23e7 -0.492 0.109 6.04e-6 0.612 0.494 0.757 6.04e-6
9 rs112352~ 10 9.88e7 0.765 0.169 6.24e-6 2.15 1.54 3.00 6.24e-6
10 rs9722322 9 2.79e7 0.347 0.0771 6.59e-6 1.42 1.22 1.65 6.59e-6
11 rs169114~ 10 5.97e7 0.681 0.152 7.54e-6 1.98 1.47 2.66 7.54e-6
12 rs122370~ 9 1.72e7 0.497 0.111 7.99e-6 1.64 1.32 2.05 7.99e-6
13 rs7714862 5 2.39e7 0.376 0.0842 8.02e-6 1.46 1.23 1.72 8.02e-6
14 rs8066656 17 7.65e7 -0.551 0.124 8.98e-6 0.576 0.452 0.735 8.98e-6
15 rs101939~ 2 1.39e8 -0.440 0.0992 9.34e-6 0.644 0.530 0.783 9.34e-6
16 rs101853~ 2 1.39e8 -0.411 0.0927 9.45e-6 0.663 0.553 0.795 9.45e-6
17 rs8015318 14 9.76e7 0.340 0.0771 1.02e-5 1.41 1.21 1.63 1.02e-5
18 rs109323~ 2 2.12e8 -0.393 0.0891 1.03e-5 0.675 0.567 0.804 1.03e-5
19 rs4648446 1 2.89e6 -0.326 0.0742 1.08e-5 0.721 0.624 0.834 1.08e-5
20 rs581815 2 2.25e8 -0.343 0.0779 1.08e-5 0.710 0.609 0.827 1.08e-5
Next, we have to fit pooled GLM using the ds.glmSNP function, obviously we have to fit the same model that we did on the GWAS (Diabetes.diagnosed.by.doctor ~ sex + HDL.cholesterol), otherwise the results will not be consistent.
The ds.glmSNP function is an extension of the dsBaseClient::ds.glm function, therefore it offers pooled analysis capabilities and disclosure controls.
Remember that we have the genotype data separated by chromosome, following that we will construct the GLM calls as follows:
s <- lapply(1:nrow(top_20_beta), function(x){
tryCatch({
ds.glmSNP(snps.fit = top_20_beta[x, 1],
model = diabetes_diagnosed_doctor ~ sex + hdl_cholesterol,
genoData = paste0("gds.Data", top_20_beta[x, 2]))
}, error = function(w){NULL})
})
# Simple data cleaning and column name changes
ss <- data.frame(do.call(rbind, s))
ss <- ss %>% rownames_to_column("rs")
pooled <- ss %>% dplyr::rename(Est.pooled = Estimate,
SE.pooled = Std..Error,
pval.pooled = p.value) %>%
dplyr::select(-c(n, p.adj))
We now extract the results of interest from the meta-analysis and the complete data.
# Get data from all individuals study
original <- results_single %>% dplyr::rename(pval.original = p.value,
Est.original = Est,
SE.original = Est.SE) %>%
dplyr::select(c(rs, Est.original, SE.original, pval.original))
# Arrange dadta from beta meta-analysis
beta <- top_20_beta %>% dplyr::rename(pval.beta = pval,
Est.beta = b.F,
SE.beta = se.F) %>%
dplyr::select(c(rs, Est.beta, SE.beta, pval.beta))
# Merge all data on a single table
total <- Reduce(function(x,y){inner_join(x,y, by = "rs")},
list(original, pooled, beta))
With all the information gathered we can visualize the results on the Table 4.
Table 4: Beta values, standard errors and p-values yielded by the three methods: GWAS of all individuals, meta-analysis of synthetic cohorts, pooled analysis of synthetic cohorts.
| SNP id | Beta | SE | P-Value | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Original | Pooled | Meta | Original | Pooled | Meta | Original | Pooled | Meta | |
| rs4648446 | −0.323 | −0.322 | −0.326 | 0.072 | 0.074 | 0.074 | 7.4 × 10−6 | 1.4 × 10−5 | 1.1 × 10−5 |
| rs10932387 | −0.389 | −0.391 | −0.393 | 0.085 | 0.090 | 0.089 | 5.2 × 10−6 | 1.2 × 10−5 | 1.0 × 10−5 |
| rs10193985 | −0.444 | −0.446 | −0.440 | 0.098 | 0.101 | 0.099 | 5.6 × 10−6 | 1.1 × 10−5 | 9.3 × 10−6 |
| rs10185343 | −0.409 | −0.415 | −0.411 | 0.091 | 0.094 | 0.093 | 7.0 × 10−6 | 1.1 × 10−5 | 9.4 × 10−6 |
| rs581815 | −0.341 | −0.344 | −0.343 | 0.077 | 0.079 | 0.078 | 8.3 × 10−6 | 1.2 × 10−5 | 1.1 × 10−5 |
| rs7714862 | 0.376 | 0.371 | 0.376 | 0.081 | 0.084 | 0.084 | 3.2 × 10−6 | 1.0 × 10−5 | 8.0 × 10−6 |
| rs9722322 | 0.353 | 0.345 | 0.347 | 0.076 | 0.078 | 0.077 | 3.3 × 10−6 | 9.7 × 10−6 | 6.6 × 10−6 |
| rs12237062 | 0.497 | 0.499 | 0.497 | 0.108 | 0.113 | 0.111 | 4.3 × 10−6 | 1.1 × 10−5 | 8.0 × 10−6 |
| rs112352624 | 0.685 | 0.814 | 0.765 | 0.160 | 0.186 | 0.169 | 1.9 × 10−5 | 1.2 × 10−5 | 6.2 × 10−6 |
| rs956025 | −0.406 | −0.405 | −0.403 | 0.083 | 0.086 | 0.085 | 9.4 × 10−7 | 2.3 × 10−6 | 2.2 × 10−6 |
| rs12423318 | 0.722 | 0.729 | 0.716 | 0.154 | 0.166 | 0.158 | 2.8 × 10−6 | 1.1 × 10−5 | 5.5 × 10−6 |
| rs17756914 | −0.622 | −0.639 | −0.613 | 0.133 | 0.141 | 0.134 | 2.7 × 10−6 | 5.8 × 10−6 | 5.1 × 10−6 |
| rs8015318 | 0.327 | 0.341 | 0.340 | 0.076 | 0.078 | 0.077 | 1.6 × 10−5 | 1.2 × 10−5 | 1.0 × 10−5 |
| rs8066656 | −0.549 | −0.565 | −0.551 | 0.123 | 0.131 | 0.124 | 8.2 × 10−6 | 1.7 × 10−5 | 9.0 × 10−6 |
| rs12459712 | −0.512 | −0.528 | −0.533 | 0.096 | 0.101 | 0.099 | 9.0 × 10−8 | 1.6 × 10−7 | 7.1 × 10−8 |
| rs721891 | −0.481 | −0.506 | −0.492 | 0.108 | 0.113 | 0.109 | 8.3 × 10−6 | 7.7 × 10−6 | 6.0 × 10−6 |
| rs8130070 | 0.401 | 0.399 | 0.398 | 0.082 | 0.085 | 0.085 | 1.1 × 10−6 | 2.8 × 10−6 | 2.5 × 10−6 |
| rs9980256 | −0.392 | −0.388 | −0.388 | 0.082 | 0.086 | 0.085 | 2.0 × 10−6 | 5.9 × 10−6 | 5.2 × 10−6 |
| rs8130914 | −0.883 | −1.028 | −0.881 | 0.191 | 0.227 | 0.191 | 3.6 × 10−6 | 5.7 × 10−6 | 4.2 × 10−6 |
Finally, to conclude this study we can compute the bias and mean squared error (code not shown). The results are shown on the Table 5.
Table 5: Mean squared error and bias for the betas from the meta-analysis and pooled analysis compared to the complete data (all individuals).
| MSE | Bias | |
|---|---|---|
| Beta | ||
| Pooled | 2.1 × 10−3 | 4.8 × 10−3 |
| Meta | 3.9 × 10−4 | −3.0 × 10−3 |
With this results we can conclude that refining the statistics for the SNPs of interest using pooled GLM is advisable. Therefore the recommended flowchart when performing a GWAS using dsOmicsClient+DataSHIELD is the following:
REVISAR ESTOS RESULTADOS
dsOmicsClient::dsGWASdsOmicsClient::metaBetaValuesdsOmicsClient::ds.glmSNPAn alternate methodology to performing a meta analysis is to use the fast GWAS methodology. This method is named after the performance it achieves, however, it does provide further advantages other than computational speed. The main advantage is that it does not compute the results for each cohort to then be meta-analyzed, it actually computes them using a pooled technique that achieves the exact same results when all the individuals are together than when they are separated by cohorts. This is of great importance as there is no need to worry about cohort-imbalance, which is usually a problem in multi-cohort meta-analysis studies. The only downside of this method is that it does not yield the same results of a traditional GWAS analysis pipeline, the inacuraccy of the results however is really low, detailed information about that can be found on the paper that proposes the method.
The performance of the function can be summarized to be ~45 seconds per 1M of SNPs (statistic extracted using 2.5k individuals divided into three study servers). From our test this performance has hold linear up to 10M SNPs, we did not test any further. The graphic for < 1M SNPs can be found on the following figure.
Figure 5.15: Benhmark of ds.fastGWAS function.
The usage is similar to the meta-GWAS with a couple of added options, feel free to check the documentation with ?ds.fastGWAS to see information about them.
results_fast <- ds.fastGWAS(genoData = paste0("gds.Data", 1:21),
formula = "diabetes_diagnosed_doctor ~ sex + hdl_cholesterol",
family = "binomial", resample = 6)
datashield.logout(conns)
We can display the results of the fast GWAS using a Manhattan plot.
manhattan(results_fast)
Figure 5.16: Manhattan plot of the Cohort 1.
To finish the GWAS section we can evaluate the results of the beta meta-analysis and the results of the fast GWAS. Both methods will be compared to the ground truth.
As before we can start by seeing the amount of SNPs that both methodologies recover as top hits.
# Extract 20 top hits for all three cases
# All individuals
top_20_complete <- results_single %>% arrange(p.value) %>% head(20)
# Fast GWAS
top_20_fast <- results_fast %>% arrange(p.value) %>% head(20)
# Beta meta-analysis
top_20_beta <- meta_bvalues %>% arrange(pval) %>% head(20)
# Number of hits shared by all individuals case and beta meta-analysis
sum(top_20_complete$rs %in% top_20_beta$rs)
[1] 13
# Number of hits shared by all individuals case and fast GWAS
sum(top_20_complete$rs %in% top_20_fast$rs)
[1] 11
On that regard it seems the meta-analysis is performing better.
We now extract the results of interest from the meta-analysis and the complete data.
# Get data from all individuals study
original <- results_single %>% dplyr::rename(pval.original = p.value,
Est.original = Est,
SE.original = Est.SE) %>%
dplyr::select(c(rs, Est.original, SE.original, pval.original))
# Arrange dadta from beta meta-analysis
beta <- top_20_beta %>% dplyr::rename(pval.beta = pval,
Est.beta = b.F,
SE.beta = se.F) %>%
dplyr::select(c(rs, Est.beta, SE.beta, pval.beta))
# Arrange data from fast GWAS
fast <- top_20_fast %>% dplyr::rename(pval.fast = p.value,
Est.fast = Est,
SE.fast = Est.SE) %>%
dplyr::select(c(rs, Est.fast, SE.fast, pval.fast))
# Merge all data on a single table
total <- Reduce(function(x,y){inner_join(x,y, by = "rs")},
list(original, fast, beta))
With all the information gathered we can visualize the results on the Table 6.
Table 6: Beta values, standard errors and p-values yielded by the three methods: GWAS of all individuals, meta-analysis of synthetic cohorts, fast GWAS.
| SNP id | Beta | SE | P-Value | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Original | Fast | Meta | Original | Fast | Meta | Original | Fast | Meta | |
| rs7714862 | 0.376 | 0.374 | 0.376 | 0.081 | 0.082 | 0.084 | 3.2 × 10−6 | 5.0 × 10−6 | 8.0 × 10−6 |
| rs9722322 | 0.353 | 0.354 | 0.347 | 0.076 | 0.076 | 0.077 | 3.3 × 10−6 | 3.1 × 10−6 | 6.6 × 10−6 |
| rs12237062 | 0.497 | 0.507 | 0.497 | 0.108 | 0.106 | 0.111 | 4.3 × 10−6 | 1.7 × 10−6 | 8.0 × 10−6 |
| rs112352624 | 0.685 | 0.723 | 0.765 | 0.160 | 0.155 | 0.169 | 1.9 × 10−5 | 3.1 × 10−6 | 6.2 × 10−6 |
| rs956025 | −0.406 | −0.375 | −0.403 | 0.083 | 0.082 | 0.085 | 9.4 × 10−7 | 4.7 × 10−6 | 2.2 × 10−6 |
| rs12423318 | 0.722 | 0.697 | 0.716 | 0.154 | 0.147 | 0.158 | 2.8 × 10−6 | 2.2 × 10−6 | 5.5 × 10−6 |
| rs8015318 | 0.327 | 0.343 | 0.340 | 0.076 | 0.076 | 0.077 | 1.6 × 10−5 | 6.7 × 10−6 | 1.0 × 10−5 |
| rs12459712 | −0.512 | −0.511 | −0.533 | 0.096 | 0.094 | 0.099 | 9.0 × 10−8 | 6.0 × 10−8 | 7.1 × 10−8 |
| rs8130070 | 0.401 | 0.401 | 0.398 | 0.082 | 0.082 | 0.085 | 1.1 × 10−6 | 1.1 × 10−6 | 2.5 × 10−6 |
| rs9980256 | −0.392 | −0.377 | −0.388 | 0.082 | 0.083 | 0.085 | 2.0 × 10−6 | 6.0 × 10−6 | 5.2 × 10−6 |
Finally, to conclude this study we can compute the bias and mean squared error (code not shown). The results are shown on the Table 7.
Table 7: Mean squared error and bias for the betas from the meta-analysis and fast GWAS compared to the complete data (all individuals).
| MSE | Bias | |
|---|---|---|
| Beta | ||
| Fast | 3.6 × 10−4 | −8.6 × 10−3 |
| Meta | 7.2 × 10−4 | −6.6 × 10−3 |
On this brief comparison we’ve seen that the fast GWAS retrieves sensibly less SNPs from the ground truth but it has better performance of bias and MSE. It must be noted that the data used for this GWAS examples does not suffer from any cohort imbalance, so the meta-analysis is expected to yield proper results, on a different environment the fast GWAS is expected to exceed the performance of the meta-analysis in a more remarkable manner.
Up to this point we have analyzed the genotype and phenotype data by performing a genome wide association study. We can further study the yielded results by performing an enrichment analysis.
The GWAS results are a collection of SNPs that displayed a relationship with a given phenotype. Those SNPs may fall inside gene regions, and those genes can be functionally related. Studying this can help researchers understand the underlying biological processes. Those analysis are carried by querying the selected gene to a database, for this example we will be using the DisGeNET database.
The first step of the enrichment analysis is to annotate the genes that contain the SNPs of interest yielded by the GWAS. There are many ways to perform this step, we propose to use the biomaRt package and retrieve the annotation from the ensembl genome browser.
First we select the SNPs of interest. We have established a p.value threshold to do so.
snps_of_interest <- results_fast %>%
dplyr::arrange(p.value) %>%
dplyr::select(chr, pos, rs, p.value) %>%
dplyr::filter(p.value < 0.00005)
And with this list of SNPs we can move to the annotation.
library(biomaRt)
# Perform ensembl connection
gene.ensembl <- useEnsembl(biomart = "ensembl",
dataset = "hsapiens_gene_ensembl", GRCh = 37)
# Query SNP by SNP
genes_interest <- Reduce(rbind, apply(snps_of_interest, 1, function(x){
result <- getBM(
attributes = c('entrezgene_id', 'chromosome_name',
'ensembl_gene_id', 'start_position',
"end_position"),
filters = c('start', 'end', 'chromosome_name'),
values = list(start = x[2], end = x[2],
chromosome_name = x[1]),
mart = gene.ensembl)
}))
Having the genes of interest annotated, the final step is to query them to DisGeNET. We will use the function enrichDGN from the package DOSE for it. This package is part of a large ecosystem of packages dedicated to enrichment analysis, this allows for easy representation of the query results.
library(DOSE)
enrichment <- enrichDGN(unlist(genes_interest$entrezgene_id),
pvalueCutoff = 0.2)
Here we will portray some of the visualizations that can be performed with the query results. For more information and further visualization options please read the official vignette.
library(enrichplot)
barplot(enrichment)
dotplot(enrichment)
library(ggnewscale)
# convert gene ID to Symbol
edox <- setReadable(enrichment, 'org.Hs.eg.db', 'ENTREZID')
cnetplot(edox, foldChange=genes_interest$entrezgene_id)
|
⚠️ RESOURCES USED ALONG THIS SECTION |
|||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
|||||||||||||||||||||
|
By checking for specific variants and quantifying them, a researcher can extract the polygenic risk score (PRS) of an individual, which translates to an associated risk of the individual versus a concrete phenotype. A definition of the term reads
“A weighted sum of the number of risk alleles carried by an individual, where the risk alleles and their weights are defined by the loci and their measured effects as detected by genome wide association studies.” (Extracted from Torkamani, Wineinger, and Topol (2018))
The use of PRS markers is very popular nowadays and it has been proved to be a good tool to asses associated risks Escott-Price et al. (2017), Forgetta et al. (2020), Kramer et al. (2020).
When calculating the PRS using DataSHIELD, the actual scores are not returned, as it would not be GDPR compliant to be able to determine the risk of an individual versus a certain disease. The actual results are stored on the servers and optionally can be merged to the phenotypes table. This allows further study of the PRS using summaries or fitting pooled GLM models to assess relationships between phenotypes and PRS scores.
The PRS use case will be set up as a multi-cohort, using the same dataset as the multi-cohort GWAS (same VCF resources). A single cohort PRS could be performed analogously using the single cohort example of the GWAS.
Figure 6.1: Proposed infrastructure to perform PRS studies.
There is a public repository of polygenic scores available at https://www.pgscatalog.org/, on this website there are 2155 different polygenic scores (January 2022) that have been extracted from curated literature. All those scores refer to 526 different traits. Due the relevance of this project, we made sure to include a way of sourcing it’s contents automatically.
There are two ways of stating the SNPs of interest and their weights in order to calculate the PRS.
Provide a PGS Catalog ‘Polygenic Score ID & Name.’ If this option is in use, the SNPs and beta weights will be downloaded from the PGS Catalog and will be used to calculate the PRS. (Example: PGS000008)
Providing a prs_table (SNPs of interest) table. The prs_table table has to have a defined structure and column names in order to be understood by this function. This table can be formulated using two schemas:
chr_name, chr_position, effect_allele, reference_allele, effect_weight.| chr_name | chr_position | effect_allele | reference_allele | effect_weight |
|---|---|---|---|---|
| 1 | 100880328 | T | A | 0.0373 |
| 1 | 121287994 | G | A | -0.0673 |
| 6 | 152023191 | A | G | 0.0626 |
| … | … | … | … | … |
rsID, effect_allele, reference_allele, effect_weight.| rdID | effect_allele | reference_allele | effect_weight |
|---|---|---|---|
| rs11206510 | T | C | 0.0769 |
| rs4773144 | G | A | 0.0676 |
| rs17514846 | A | C | 0.0487 |
| … | … | … | … |
The effect_weight have to be the betas (log(OR)).
Please note when using a custom prs_table table that it is much faster to use the Schema 1, as the subset of the VCF files is miles faster to do using chromosome name and positions rather than SNP id’s.
We have to create an Opal connection object to the cohort server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort3", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
We assign the VCF and phenotypes resources.
# Cohort 1 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[1], paste0("chr", x), paste0("GWAS.chr", x,"A"))
})
DSI::datashield.assign.resource(conns[1], "phenotypes_resource", "GWAS.ega_phenotypes_1")
# Cohort 2 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[2], paste0("chr", x), paste0("GWAS.chr", x,"B"))
})
DSI::datashield.assign.resource(conns[2], "phenotypes_resource", "GWAS.ega_phenotypes_2")
# Cohort 3 resources
lapply(1:21, function(x){
DSI::datashield.assign.resource(conns[3], paste0("chr", x), paste0("GWAS.chr", x,"C"))
})
DSI::datashield.assign.resource(conns[3], "phenotypes_resource", "GWAS.ega_phenotypes_3")
DSI::datashield.assign.expr(conns = conns, symbol = "phenotypes",
expr = as.symbol("as.resource.data.frame(phenotypes_resource)"))
There is one crucial difference between this analysis and the GWAS, here we don’t need to resolve the resources. This is for memory efficiency purposes. When we calculate a PRS, we are only interested in a small selection of SNPs, for that reason the (potentially) huge VCF files will be trimmed before being resolved, so only the relevant SNPs are loaded in the memory. For that reason the resolving of the resources is delegated to the ds.PRS function, which will perform this trimming according to the SNPs of interest yielded by the PGS catalog query or the prs_table table. [CAMBIAR]
Finally the resources are resolved to retrieve the data in the remote sessions.
lapply(1:21, function(x){
DSI::datashield.assign.expr(conns = conns, symbol = paste0("gds", x, "_object"),
expr = as.symbol(paste0("as.resource.object(chr", x, ")")))
})
The PRS results will be added to the phenotypes table to be later studied using a pooled GLM. To do that, the following arguments of the function ds.PRS will be used:
table: Name of the table where the results will be merged.table_id_column: Number of the column in the table table that contains the individuals ID to merge the results from the PRS.table_prs_name: Name to append to the PRS results. It will take the structure prs_table_prs_name and prs_nw_table_prs_name.For this example we will calculate the PRS associated to the trait ‘HDL Cholesterol’ and ‘Thyroid cancer,’ which corresponds to the PGS ID PGS000660 and PGS000636
resources <- paste0("gds", 1:21, "_object")
# HDL cholesterol
ds.PRS(resources = resources, pgs_id = "PGS000660",
table = "phenotypes", table_id_column = "subject_id")
# Thyroid cancer
ds.PRS(resources = resources, pgs_id = "PGS000636",
table = "phenotypes", table_id_column = "subject_id")
We can check that the results have been correctly added to the phenotypes table.
tail(ds.colnames('phenotypes')[[1]])
[1] "prs_PGS000660" "prs_nw_PGS000660" "n_snps_PGS000660" "prs_PGS000636"
[5] "prs_nw_PGS000636" "n_snps_PGS000636"
prs_table exampleHere we will illustrate a use case with a custom prs_table, please note that the proposed SNPs of interest and betas are totally arbitrary.
We will first assemble the prs_table table following the indications found on the Details section of ?ds.PRS (Schema 1).
prs_table <- data.frame(chr_name = c(9, 9, 9, 9, 1, 1, 2, 2),
chr_position = c(177034, 161674, 140859, 119681,
60351, 825410, 154629, 145710),
effect_allele = c("T", "C", "G", "C", "G", "G", "T", "A"),
reference_allele = c("G", "T", "A", "T", "A", "A", "G", "G"),
effect_weight = c(0.5, 0.2, 0.1, 0.2, 0.1, 0.3, 0.1, 0.8))
Now we can pass the custom prs_table to the PRS function to retrieve the aggregated results.
resources <- paste0("gds", 1:21, "_object")
ds.PRS(resources = resources, prs_table = prs_table,
table = "phenotypes", table_id_column = "subject_id",
table_prs_name = "custom_prs_table_results")
We can fit pooled GLM models using the function ds.glm. In this use case we will fit a model to relate the HDL.cholesterol to prs_PGS000660 adjusted by sex. The same will be done with the prs_PGS000636.
ds.glm(formula = "hdl_cholesterol ~ prs_PGS000660 + sex",
family = "gaussian", data = "phenotypes")$coefficients
Estimate Std. Error z-value p-value low0.95CI
(Intercept) 1.576448314 0.178438076 8.8347081 1.003600e-18 1.22671611
prs_PGS000660 -0.003678982 0.009899804 -0.3716217 7.101746e-01 -0.02308224
sexmale 0.002311580 0.016140511 0.1432161 8.861195e-01 -0.02932324
high0.95CI
(Intercept) 1.92618052
prs_PGS000660 0.01572428
sexmale 0.03394640
ds.glm(formula = "hdl_cholesterol ~ prs_PGS000636 + sex",
family = "gaussian", data = "phenotypes")$coefficients
Estimate Std. Error z-value p-value low0.95CI
(Intercept) 1.522002437 0.01614425 94.2751901 0.0000000 1.49036028
prs_PGS000636 -0.066872228 0.06553929 -1.0203380 0.3075682 -0.19532688
sexmale 0.002416221 0.01613787 0.1497237 0.8809826 -0.02921342
high0.95CI
(Intercept) 1.55364459
prs_PGS000636 0.06158242
sexmale 0.03404586
datashield.logout(conns)
|
⚠️ RESOURCES USED ALONG THIS SECTION |
||||||
|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
||||||
|
In this section we will illustrate how to perform transcriptomic data analysis using data from TCGA project (section 3.3). We have uploaded to the demo Opal server a resource called tcga_liver whose URL is http://duffel.rail.bio/recount/TCGA/rse_gene_liver.Rdata which is available through the recount project. The resource used on this section contains a RangedSummarizedExperimentwith the RNAseq profiling of liver cancer data from TCGA.
The resource is located inside the OMICS project. Some summaries of the dataset are the following:
| TCGA Liver data | |
|---|---|
| Number of individuals | 424 |
| Number of genes | 58,037 |
| Number of covariate fields | 864 |
The structure used is illustrated on the following figure.
Figure 7.1: Proposed infrastructure to perform DGE studies.
We illustrate a differential expression analysis to compare RNAseq profiling of women vs men (variable gdc_cases.demographic.gender). The DGE analysis is normally performed using limma package. In that case, as we are analyzing RNA-seq data, limma + voom method will be required.
The following use cases will be illustrated:
We have to create an Opal connection object to the cohort server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
Now that we have created a connection object to the Opal, we have started a new R session on the server, our analysis will take place in this remote session, so we have to load the data into it.
In this use case we will use one resource from the OMICS project hosted on the demo Opal server. This resources correspond to RangedsummarizedExperiment dataset. The names of the resource is tcga_liver, we will refer to it using the string OMICS.tcga_liver.
DSI::datashield.assign.resource(conns, "rse_resource", "OMICS.tcga_liver")
Now we have assigned the resource named OMICS.tcga_liver into our remote R session. We have assigned it to a variable called rse_resource. To verify this step has been performed correctly, we could use the ds.class function to check for their class and that they exist on the remote sessions.
ds.class("rse_resource")
$cohort1
[1] "RDataFileResourceClient" "FileResourceClient"
[3] "ResourceClient" "R6"
We can see that the object rse_resource exists in the server.
Finally the resource is resolved to retrieve the data in the remote session.
DSI::datashield.assign.expr(conns = conns, symbol = "rse",
expr = as.symbol("as.resource.object(rse_resource)"))
Now we have resolved the resource named rse_resource into our remote R session. The object retrieved has been assigned into the variable named rse. We can check the process was successful as we did before.
ds.class("rse")
$cohort1
[1] "RangedSummarizedExperiment"
attr(,"package")
[1] "SummarizedExperiment"
The number of features and samples can be inspected with:
ds.dim("rse")
$`dimensions of rse in cohort1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
And the names of the features using the same function used in the case of analyzing an ExpressionSet:
name.features <- ds.featureNames("rse")
lapply(name.features, head)
$cohort1
[1] "ENSG00000000003.14" "ENSG00000000005.5" "ENSG00000000419.12"
[4] "ENSG00000000457.13" "ENSG00000000460.16" "ENSG00000000938.12"
Also the covariate names can be inspected by:
name.vars <- ds.featureData("rse")
lapply(name.vars, head, n=15)
$cohort1
[1] "project"
[2] "sample"
[3] "experiment"
[4] "run"
[5] "read_count_as_reported_by_sra"
[6] "reads_downloaded"
[7] "proportion_of_reads_reported_by_sra_downloaded"
[8] "paired_end"
[9] "sra_misreported_paired_end"
[10] "mapped_read_count"
[11] "auc"
[12] "sharq_beta_tissue"
[13] "sharq_beta_cell_type"
[14] "biosample_submission_date"
[15] "biosample_publication_date"
We can visualize the levels of the variable having gender information that will be our condition (i.e., we are interested in obtaining genes that are differentially expressed between males and females).
ds.table1D("rse$gdc_cases.demographic.gender")
$counts
rse$gdc_cases.demographic.gender
female 143
male 281
Total 424
$percentages
rse$gdc_cases.demographic.gender
female 33.73
male 66.27
Total 100.00
$validity
[1] "All tables are valid!"
We have implemented a function called ds.RNAseqPreproc() to perform RNAseq data pre-processing that includes:
ds.RNAseqPreproc('rse', group = 'gdc_cases.demographic.gender',
newobj.name = 'rse.pre')
Note that it is recommended to indicate the grouping variable (i.e., condition). Once data has been pre-processed, we can perform differential expression analysis. Notice how dimensions have changed given the fact that we have removed genes with low expression which are expected to do not be differentially expressed.
ds.dim('rse')
$`dimensions of rse in cohort1`
[1] 58037 424
$`dimensions of rse in combined studies`
[1] 58037 424
ds.dim('rse.pre')
$`dimensions of rse.pre in cohort1`
[1] 40363 424
$`dimensions of rse.pre in combined studies`
[1] 40363 424
The differential expression analysis in dsOmicsClient/dsOmics is implemented in the funcion ds.limma(). This functions runs a limma-pipeline for microarray data and for RNAseq data allows:
We recommend to use the voom + limma pipeline proposed here given its versatility and that limma is much faster than DESeq2 and edgeR. By default, the function considers that data is obtained from a microarray experiment (type.data = "microarray"). Therefore, as we are analyzing RNAseq data, we must indicate that type.data = "RNAse".
ans.gender <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq")
The top differentially expressed genes can be visualized by:
ans.gender
$cohort1
# A tibble: 40,363 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000274655.1 424 -12.4 0.0761 -52.1 2.74e-187 1.11e-182
2 ENSG00000270641.1 424 -10.2 0.461 -43.8 5.21e-160 1.05e-155
3 ENSG00000229807.10 424 -11.0 0.0603 -39.5 1.08e-144 1.45e-140
4 ENSG00000277577.1 424 -11.3 0.0651 -36.0 2.27e-131 2.29e-127
5 ENSG00000233070.1 424 10.9 0.0885 35.5 1.85e-129 1.49e-125
6 ENSG00000260197.1 424 10.2 0.118 32.9 3.72e-119 2.50e-115
7 ENSG00000213318.4 424 11.4 0.128 31.9 5.57e-115 3.21e-111
8 ENSG00000278039.1 424 -7.78 0.0812 -28.8 3.85e-102 1.94e- 98
9 ENSG00000067048.16 424 9.62 0.0894 27.4 4.72e- 96 2.12e- 92
10 ENSG00000131002.11 424 11.4 0.0924 27.3 9.63e- 96 3.89e- 92
# ... with 40,353 more rows
attr(,"class")
[1] "dsLimma" "list"
We can verify whether the distribution of the observed p-values are the ones we expect in this type of analyses:
hist(ans.gender$cohort1$P.Value, xlab="Raw p-value gender effect",
main="", las=1, cex.lab=1.5, cex.axis=1.2, col="gray")
We can also check whether there is inflation just executing
qqplot(ans.gender$cohort1$P.Value)
So, in that case, the model needs to remove unwanted variability (\(\lambda > 2\)). If so, we can use surrogate variable analysis just changing the argument sva=TRUE.
ans.gender.sva <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq",
sva = TRUE)
Now the inflation has dramatically been reduced (\(\lambda > 1.12\)).
qqplot(ans.gender.sva$cohort1$P.Value)
We can add annotation to the output that is available in our RSE object. We can have access to this information by:
ds.fvarLabels('rse.pre')
$cohort1
[1] "chromosome" "start" "end" "width" "strand"
[6] "gene_id" "bp_length" "symbol"
attr(,"class")
[1] "dsfvarLabels" "list"
So, we can run:
ans.gender.sva <- ds.limma(model = ~ gdc_cases.demographic.gender,
Set = "rse.pre", type.data = "RNAseq",
sva = TRUE, annotCols = c("chromosome"))
The results are:
ans.gender.sva
$cohort1
# A tibble: 40,363 x 8
id n beta SE t P.Value adj.P.Val chromosome
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 ENSG00000274655.1 424 -12.4 0.0761 -52.1 2.74e-187 1.11e-182 chrX
2 ENSG00000270641.1 424 -10.2 0.461 -43.8 5.21e-160 1.05e-155 chrX
3 ENSG00000229807.10 424 -11.0 0.0603 -39.5 1.08e-144 1.45e-140 chrX
4 ENSG00000277577.1 424 -11.3 0.0651 -36.0 2.27e-131 2.29e-127 chrX
5 ENSG00000233070.1 424 10.9 0.0885 35.5 1.85e-129 1.49e-125 chrY
6 ENSG00000260197.1 424 10.2 0.118 32.9 3.72e-119 2.50e-115 chrY
7 ENSG00000213318.4 424 11.4 0.128 31.9 5.57e-115 3.21e-111 chr16
8 ENSG00000278039.1 424 -7.78 0.0812 -28.8 3.85e-102 1.94e- 98 chrX
9 ENSG00000067048.16 424 9.62 0.0894 27.4 4.72e- 96 2.12e- 92 chrY
10 ENSG00000131002.11 424 11.4 0.0924 27.3 9.63e- 96 3.89e- 92 chrY
# ... with 40,353 more rows
attr(,"class")
[1] "dsLimma" "list"
The function has another arguments that can be used to fit other type of models:
sva: estimate surrogate variablesannotCols: to add annotation available in themethod: Linear regression (“ls”) or robust regression (“robust”) used in limma (lmFit)robust: robust method used for outlier sample variances used in limma (eBayes)normalization: normalization method used in the voom transformation (default “none”)voomQualityWeights: should voomQualityWeights function be used instead of voom? (default FALSE)big: should SmartSVA be used instead of SVA (useful for big sample size or when analyzing epigenome data. Default FALSE)We have also implemented two other functions ds.DESeq2 and ds.edgeR that perform DGE analysis using DESeq2 and edgeR methods. This is the R code used to that purpose:
To be supplied
We close the DataSHIELD session by:
datashield.logout(conns)
|
⚠️ RESOURCES USED ALONG THIS SECTION |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
|
From https://opal-demo.obiba.org/ : |
|||||||||
|
In this section we will illustrate how to perform an epigenome-wide association analysis (EWAS) using methylation data. EWAS requires basically the same statistical methods as those used in DGE. It should be noticed that the pooled analysis we are going to illustrate here can also be performed with transcriptomic data, however the data needs to be harmonized beforehand to ensure each study has the same range values. For EWAS where methylation is measured using beta values (e.g CpG data are in the range 0-1) this is not a problem. In any case, adopting the meta-analysis approach could be a safe option without the need of harmonization.
Moreover, we encourage to perform pooled analysis only on the significant hits obtained by the meta-analysis, since it is a much slower methodology.
The data used in this section has been downloaded from GEO (accession number GSE66351) which contains DNA methylation profiling (Illumina 450K array) (section 3.4). Data corresponds to CpGs beta values measured in the superior temporal gyrus and prefrontal cortex brain regions of patients with Alzheimer’s.
This kind of data is encapsulated on a type of R object called ExpressionSet, this objects are part of the BioConductor project and are meant to contain different sources of genomic data, alongside the genomic data they can also contain the phenotypes and metadata associated to a study. Researchers who are not familiar with ExpressionSet can find further information here.
The data that we will be using along this section is an ExpressionSet objects. Notice that genomic data is encoded as beta-values that ensure data harmonization across studies.
In order to illustrate how to perform data analyses using federated data, we have split the data into two synthetic cohorts (split by individuals). We have created two resources on the demo Opal server called GSE66351_1 and GSE66351_2 respectively. They can be found inside the OMICS project. Some summaries of the datasets are the following:
| Cohort 1 | Cohort 2 | Total | |
|---|---|---|---|
| Number of CpGs | 481,868 | 481,868 | 481,868 |
| Number of individuals | 100 | 90 | 190 |
| Number of covariates | 49 | 49 | 49 |
| Number of annotation features | 37 | 37 | 37 |
The structure used is illustrated on the following figure.
Figure 8.1: Proposed infrastructure to perform EWAS studies.
The data analyst corresponds to the “RStudio” session, which through DataSHIELD Interface (DSI) connects with the Opal server. The Opal servers contain a resource that correspond to the GEO:GSE66351 ExpressionSet (subseted by individuals).
We will illustrate the following use cases:
We have to create an Opal connection object to the different cohorts server. We do that using the following functions.
require('DSI')
require('DSOpal')
require('dsBaseClient')
require('dsOmicsClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "cohort1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
builder$append(server = "cohort2", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd",
driver = "OpalDriver", profile = "omics")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
It is important to note that in this use case, we are only using one server (https://opal-demo.obiba.org/), on this server there are all the resources that correspond to the different cohorts. On a more real scenario each one of the builder$append instructions would be connecting to a different server.
ExpressionSet resourceNow that we have created a connection object to the different Opals, we have started two R sessions, our analysis will take place on those remote sessions, so we have to load the data into them.
In this use case we will use 2 different resources from the OMICS project hosted on the demo Opal server. The names of the resources are GSE66351_X (where X is the cohort identifier 1/2). Following the Opal syntax, we will refer to them using the string OMICS.GSE66351_X.
We have to refer specifically to each different server by using conns[X], this allows us to communicate with the server of interest to indicate to it the resource that it has to load.
# Cohort 1 resource load
DSI::datashield.assign.resource(conns[1], "eSet_resource", "OMICS.GSE66351_1")
# Cohort 2 resource load
DSI::datashield.assign.resource(conns[2], "eSet_resource", "OMICS.GSE66351_2")
Now we have assigned all the resources named into our remote R sessions. We have assigned them to the variables called eSet_resource. To verify this step has been performed correctly, we can use the ds.class function to check for their class and that they exist on the remote sessions.
ds.class("eSet_resource")
$cohort1
[1] "RDataFileResourceClient" "FileResourceClient"
[3] "ResourceClient" "R6"
$cohort2
[1] "RDataFileResourceClient" "FileResourceClient"
[3] "ResourceClient" "R6"
We can see that the object eSet_resource exists in both servers.
Finally the resource is resolved to retrieve the data in the remote sessions.
DSI::datashield.assign.expr(conns = conns, symbol = "eSet",
expr = as.symbol("as.resource.object(eSet_resource)"))
Now we have resolved the resource named eSet_resource into our remote R session. The object retrieved has been assigned into the variable named eSet. We can check the process was successful as we did before.
ds.class("eSet")
$cohort1
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
$cohort2
[1] "ExpressionSet"
attr(,"package")
[1] "Biobase"
ExpressionSetFeature names can be returned by:
fn <- ds.featureNames("eSet")
lapply(fn, head)
$cohort1
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
[6] "cg00000289"
$cohort2
[1] "cg00000029" "cg00000108" "cg00000109" "cg00000165" "cg00000236"
[6] "cg00000289"
Experimental phenotypes variables can be obtained by:
fn <- ds.varLabels("eSet")
lapply(fn, head)
$cohort1
[1] "title" "geo_accession" "status" "submission_date"
[5] "last_update_date" "type"
$cohort2
[1] "title" "geo_accession" "status" "submission_date"
[5] "last_update_date" "type"
The columns of the annotation can be obtained by:
fn <- ds.fvarLabels("eSet")
lapply(fn, head)
$cohort1
[1] "ID" "Name" "AddressA_ID" "AlleleA_ProbeSeq"
[5] "AddressB_ID" "AlleleB_ProbeSeq"
$cohort2
[1] "ID" "Name" "AddressA_ID" "AlleleA_ProbeSeq"
[5] "AddressB_ID" "AlleleB_ProbeSeq"
Once the methylation data have been loaded into the opal server, we can perform different type of analyses using functions from the dsOmicsClient package. Let us start by illustrating how to analyze a single CpG from two cohorts by using an approach that is mathematically equivalent to placing all individual-level (pooled).
ans <- ds.lmFeature(feature = "cg07363416",
model = ~ diagnosis + Sex,
Set = "eSet",
datasources = conns)
ans
Estimate Std. Error p-value
cg07363416 0.03459886 0.02504291 0.1670998
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
The same analysis can be performed for multiple features. This process can be parallelized using mclapply function from the multicore package (only works on GNU/Linux, not on Windows).
ans <- ds.lmFeature(feature = c("cg00000029", "cg00000108", "cg00000109", "cg00000165"),
model = ~ diagnosis + Sex,
Set = "eSet",
datasources = conns,
mc.cores = 20)
If the feature argument is not supplied, all the features will be analyzed, please note that this process can be extremely slow if there is a huge number of features; for example, on the case we are illustrating we have over 400K features, so this process would take too much time.
If all the features are to be studied, we recommend switching to meta-analysis methods. More information on the next section.
We can adopt another strategy that is to run a glm of each feature independently at each study using limma package (which is really fast) and then combine the results (i.e. meta-analysis approach).
ans.limma <- ds.limma(model = ~ diagnosis + Sex,
Set = "eSet",
datasources = conns)
Then, we can visualize the top genes at each study (i.e server) by:
lapply(ans.limma, head)
$cohort1
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg13138089 100 -0.147 0.0122 -6.62 0.00000000190 0.000466
2 cg23859635 100 -0.0569 0.00520 -6.58 0.00000000232 0.000466
3 cg13772815 100 -0.0820 0.0135 -6.50 0.00000000327 0.000466
4 cg12706938 100 -0.0519 0.00872 -6.45 0.00000000425 0.000466
5 cg24724506 100 -0.0452 0.00775 -6.39 0.00000000547 0.000466
6 cg02812891 100 -0.125 0.0163 -6.33 0.00000000731 0.000466
$cohort2
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg04046629 90 -0.101 0.0128 -5.91 0.0000000621 0.0172
2 cg07664323 90 -0.0431 0.00390 -5.85 0.0000000822 0.0172
3 cg27098804 90 -0.0688 0.0147 -5.79 0.000000107 0.0172
4 cg08933615 90 -0.0461 0.00791 -5.55 0.000000298 0.0360
5 cg18349298 90 -0.0491 0.00848 -5.42 0.000000507 0.0489
6 cg02182795 90 -0.0199 0.0155 -5.36 0.000000670 0.0538
The annotation can be added by using the argument annotCols. It should be a vector with the columns of the annotation available in the ExpressionSet or RangedSummarizedExperiment that want to be showed. To obtain the available annotation columns revisit #ewas-inspect.
ans.limma.annot <- ds.limma(model = ~ diagnosis + Sex,
Set = "eSet",
annotCols = c("CHR", "UCSC_RefGene_Name"),
datasources = conns)
lapply(ans.limma.annot, head)
$cohort1
# A tibble: 6 x 9
id n beta SE t P.Value adj.P.Val CHR UCSC_RefGene_Na~
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 cg1313~ 100 -0.147 0.0122 -6.62 1.90e-9 0.000466 2 "ECEL1P2"
2 cg2385~ 100 -0.0569 0.00520 -6.58 2.32e-9 0.000466 2 "MTA3"
3 cg1377~ 100 -0.0820 0.0135 -6.50 3.27e-9 0.000466 17 ""
4 cg1270~ 100 -0.0519 0.00872 -6.45 4.25e-9 0.000466 19 "MEX3D"
5 cg2472~ 100 -0.0452 0.00775 -6.39 5.47e-9 0.000466 19 "ISOC2;ISOC2;IS~
6 cg0281~ 100 -0.125 0.0163 -6.33 7.31e-9 0.000466 2 "ECEL1P2"
$cohort2
# A tibble: 6 x 9
id n beta SE t P.Value adj.P.Val CHR UCSC_RefGene_Na~
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 cg0404~ 90 -0.101 0.0128 -5.91 6.21e-8 0.0172 11 "CD6"
2 cg0766~ 90 -0.0431 0.00390 -5.85 8.22e-8 0.0172 6 "MUC21"
3 cg2709~ 90 -0.0688 0.0147 -5.79 1.07e-7 0.0172 11 "CD6"
4 cg0893~ 90 -0.0461 0.00791 -5.55 2.98e-7 0.0360 1 ""
5 cg1834~ 90 -0.0491 0.00848 -5.42 5.07e-7 0.0489 3 "RARRES1;RARRES~
6 cg0218~ 90 -0.0199 0.0155 -5.36 6.70e-7 0.0538 8 ""
Then, the last step is to meta-analyze the results. Different methods can be used to this end. We have implemented a method that meta-analyze the p-pvalues of each study as follows:
ans.meta <- metaPvalues(ans.limma)
ans.meta
# A tibble: 481,868 x 4
id cohort1 cohort2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg13138089 0.00000000190 0.00000763 4.78e-13
2 cg25317941 0.0000000179 0.00000196 1.12e-12
3 cg02812891 0.00000000731 0.00000707 1.63e-12
4 cg12706938 0.00000000425 0.0000161 2.14e-12
5 cg16026647 0.000000101 0.000000797 2.51e-12
6 cg12695465 0.00000000985 0.0000144 4.33e-12
7 cg21171625 0.000000146 0.00000225 9.78e-12
8 cg13772815 0.00000000327 0.000122 1.18e-11
9 cg00228891 0.000000166 0.00000283 1.38e-11
10 cg21488617 0.0000000186 0.0000299 1.62e-11
# ... with 481,858 more rows
We can verify that the results are pretty similar to those obtained using pooled analyses. Here we compute the association for the top two CpGs:
res <- ds.lmFeature(feature = ans.meta$id[1:2],
model = ~ diagnosis + Sex,
Set = "eSet",
datasources = conns)
res
Estimate Std. Error p-value p.adj
cg25317941 -0.03452376 0.004303781 1.042679e-15 1.057482e-15
cg13138089 -0.13733479 0.017124046 1.057482e-15 1.057482e-15
attr(,"class")
[1] "dsLmFeature" "matrix" "array"
We can create a QQ-plot by using the function qqplot available in our package.
qqplot(ans.meta$p.meta)
Here In some cases inflation can be observed, so that, correction for cell-type or surrogate variables must be performed. We describe how we can do that in the next two sections.
The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis (SVA) can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality among others. The method is implemented in SVA package.
Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. For that reason we have to adopt a compromise solution which is to perform the SVA independently at each study. We use the ds.limma function to perform the analyses adjusted for SVA at each study.
ans.sva <- ds.limma(model = ~ diagnosis + Sex,
Set = "eSet",
sva = TRUE, annotCols = c("CHR", "UCSC_RefGene_Name"))
ans.sva
$cohort1
# A tibble: 481,868 x 9
id n beta SE t P.Value adj.P.Val CHR UCSC_RefGene_Na~
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 cg1313~ 100 -0.147 0.0122 -6.62 1.90e-9 0.000466 2 "ECEL1P2"
2 cg2385~ 100 -0.0569 0.00520 -6.58 2.32e-9 0.000466 2 "MTA3"
3 cg1377~ 100 -0.0820 0.0135 -6.50 3.27e-9 0.000466 17 ""
4 cg1270~ 100 -0.0519 0.00872 -6.45 4.25e-9 0.000466 19 "MEX3D"
5 cg2472~ 100 -0.0452 0.00775 -6.39 5.47e-9 0.000466 19 "ISOC2;ISOC2;IS~
6 cg0281~ 100 -0.125 0.0163 -6.33 7.31e-9 0.000466 2 "ECEL1P2"
7 cg2766~ 100 -0.0588 0.0198 -6.33 7.48e-9 0.000466 16 "ANKRD11"
8 cg1537~ 100 -0.0709 0.0115 -6.32 7.83e-9 0.000466 2 "LPIN1"
9 cg1552~ 100 -0.0446 0.00750 -6.29 8.69e-9 0.000466 10 ""
10 cg1269~ 100 -0.0497 0.00155 -6.27 9.85e-9 0.000475 6 "GCNT2;GCNT2;GC~
# ... with 481,858 more rows
$cohort2
# A tibble: 481,868 x 9
id n beta SE t P.Value adj.P.Val CHR UCSC_RefGene_Name
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 cg040~ 90 -0.101 0.0128 -5.91 6.21e-8 0.0172 11 "CD6"
2 cg076~ 90 -0.0431 0.00390 -5.85 8.22e-8 0.0172 6 "MUC21"
3 cg270~ 90 -0.0688 0.0147 -5.79 1.07e-7 0.0172 11 "CD6"
4 cg089~ 90 -0.0461 0.00791 -5.55 2.98e-7 0.0360 1 ""
5 cg183~ 90 -0.0491 0.00848 -5.42 5.07e-7 0.0489 3 "RARRES1;RARRES1"
6 cg021~ 90 -0.0199 0.0155 -5.36 6.70e-7 0.0538 8 ""
7 cg160~ 90 -0.0531 0.0196 -5.31 7.97e-7 0.0548 17 "MEIS3P1"
8 cg012~ 90 -0.0537 0.00971 -5.18 1.39e-6 0.0582 7 "HOXA2"
9 cg251~ 90 -0.0224 0.00736 -5.15 1.57e-6 0.0582 3 "ZBTB20;ZBTB20;ZB~
10 cg074~ 90 -0.0475 0.00166 -5.13 1.67e-6 0.0582 22 "C22orf27"
# ... with 481,858 more rows
attr(,"class")
[1] "dsLimma" "list"
Then, data can be combined meta-anlyzed as follows:
ans.meta.sv <- metaPvalues(ans.sva)
ans.meta.sv
# A tibble: 481,868 x 4
id cohort1 cohort2 p.meta
<chr> <dbl> <dbl> <dbl>
1 cg13138089 0.00000000190 0.00000763 4.78e-13
2 cg25317941 0.0000000179 0.00000196 1.12e-12
3 cg02812891 0.00000000731 0.00000707 1.63e-12
4 cg12706938 0.00000000425 0.0000161 2.14e-12
5 cg16026647 0.000000101 0.000000797 2.51e-12
6 cg12695465 0.00000000985 0.0000144 4.33e-12
7 cg21171625 0.000000146 0.00000225 9.78e-12
8 cg13772815 0.00000000327 0.000122 1.18e-11
9 cg00228891 0.000000166 0.00000283 1.38e-11
10 cg21488617 0.0000000186 0.0000299 1.62e-11
# ... with 481,858 more rows
And we can revisit the qqplot:
qqplot(ans.meta.sv$p.meta)
The DataSHIELD session must be closed by:
datashield.logout(conns)
In this section, we will illustrate how to perform gene expression analysis using data from the HELIX project. Particularly, we will analyse microarray data deriving from the Human Transcriptome Array 2.0 Affymetrix platform. On these data, we will show how to perform a differential gene expression analysis comparing the transcriptomic profiles of boys and girls in the HELIX cohorts.
The microarray data to be analysed has been previously pre-processed (e.g., normalized, low expression or cross-reactive probes removed, etc.), so, this showcase will be restricted to the analysis part. For that, we will employ the dsOmicsClient functions that adapt some of the most important “limma” methods for their use in dataSHIELD. Please, note that pre-processing of gene expression data is also included in dsOmicsClient as indicated in our bookdown. For this tutorial, data are hosted in the Opal BRGE site simulating a single-site DataSHIELD architecture, as previously described in the section 3.5.1.1.
In this section, we will describe how to configure R and DataSHIELD with the needed packages to carry out proposed analyses in remote. We start by installing the client-side version of the following DataSHIELD/Opal integration packages.
install.packages("DSOpal", dependencies=TRUE)
install.packages("DSI", dependencies=TRUE)
Make sure you also install the DataSHIELD client-side version of the package dsBaseClient.
install.packages("dsBaseClient",
repos = c("http://cran.datashield.org","https://cloud.r-project.org/"),
dependencies = TRUE)
Then, install the client-side version of the dsOmicsClient package directly from GitHub.
install.packages("devtools")
require("devtools")
devtools::install_github("isglobal-brge/dsOmicsClient",ref="master",
auth_token = "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",force=TRUE)
Once installations are completed, all the packages are loaded as usual.
require(DSOpal)
require(DSI)
require(dsBaseClient)
require(dsOmicsClient)
# Loading additional required packages (if not installed, you can easly install them using the BiocManager::install() function)
require(clusterProfiler)
require(org.Hs.eg.db)
require(ggplot2)
require(biomaRt)
In this section, we will cover how to load and inspect input microarray data with DataSHIELD. We start by creating the connection to the opal server using an user who have DataSHIELD permissions.
builder <- DSI::newDSLoginBuilder()
builder$append(server = "BIB", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "EDEN", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "KANC", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "MoBA", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "Rhea", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "INMASAB", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
Now that we have created a connection object to the Opal, we have started a new R session on the server, and our analysis will take place in this remote session, so we have to load the data into it. In this use case, available data are in the form ExpressionSet files. ExpressionSet is a file-format of the BioConductor project that may contain different sources of genomic data. Alongside the genomic data, ExpressionSet may also contain some phenotypes and metadata associated to the population. Researchers who are not familiar with ExpressionSet can find further information here.
For our showcase, one ExpressionSet file per cohort is hosted in the Opal BRGE site in the form of a resource, all organized within the Opal server in a project called “HELIX_omics”.
For loading these data into the remote DataSHIELD session we will use the DSI::datashield.assign.resource() function. This function takes the connections to the server created in the previous code chunk to assign all available resource objects from a project in the Opal to an R object in the DataSHIELD remote session. As it can be seen in the code, resources in DataSHIELD are called in the function following the next structure “NameOfOpalProject.NameOfResource”.
# We assign microarray data from all cohorts to an object called resource_pos
DSI::datashield.assign.resource(conns[1], "resource_expr",
"HELIX_omics.genexpr_BIB")
DSI::datashield.assign.resource(conns[2], "resource_expr",
"HELIX_omics.genexpr_EDE")
DSI::datashield.assign.resource(conns[3], "resource_expr",
"HELIX_omics.genexpr_KAN")
DSI::datashield.assign.resource(conns[4], "resource_expr",
"HELIX_omics.genexpr_MOB")
DSI::datashield.assign.resource(conns[5], "resource_expr",
"HELIX_omics.genexpr_RHE")
DSI::datashield.assign.resource(conns[6], "resource_expr",
"HELIX_omics.genexpr_SAB")
With these instructions, we have assigned the resources into our remote R session to a variable called “resource_expr”. To verify this step has been performed correctly, we could use the ds.class() function to check for their class and that they exist on the remote session. Then, we have to resolve the resource and retrieve the data in the remote session. For that, we will use the DSI::datashield.assign.expr() function.
# We resolve the resource
DSI::datashield.assign.expr(conns = conns, symbol = "resource_expr_HELIX",
expr = as.symbol("as.resource.object(resource_expr)"))
As a result, we will get an R object (here named “resource_expr_HELIX”) containing the available “ExpressionSets” files for cohorts. For created objects, we can extract some useful information including the class, and the name of the genes included in the microarray and additional phenotype data available.
# Retrieve the class of loaded files:
ds.class("resource_expr_HELIX")
# Get the names of the first features (probes) available for the first cohort:
name.features <- ds.featureNames("resource_expr_HELIX")
lapply(name.features, head)[[1]]
# Get the names of the phenotype data available for the first cohort:
name.phenotypes <- ds.varLabels("resource_expr_HELIX")
lapply(name.phenotypes, head)[[1]]
# Also the annotation information can be inspected by:
name.annot <- ds.fvarLabels("resource_expr_HELIX")
lapply(name.annot, head)[[1]]
# Annotation information refers to additional data available at the feature
# level and that could be added to the summary statistics for a better
# identification of significant genes.
The differential expression analysis in dsOmicsClient is implemented in the funcion ds.limma(). Although, by default, the function considers that data is obtained from a microarray experiment (type.data = “microarray”), it could also be applied to RNAseq data.
In this case, since we are interested in evaluating the gene expression differences between boys and girls, we will run the next model (adjusted by possible confounders such as age and ethnicity):
model.sex <- ds.limma(model = ~ e3_sex + age_sample_years + h_ethnicity_cauc,
Set = "resource_expr_HELIX")
The top differentially expressed genes by cohort can then be visualized by:
model.sex
$BIB
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC13001096.hg.1 162 -0.126 0.164 -7.09 4.12e-11 0.00000119
2 TC22000217.hg.1 162 -0.156 0.0180 -6.74 2.72e-10 0.00000391
3 TC17000757.hg.1 162 0.0900 0.0233 6.22 4.23e- 9 0.0000387
4 TC20000999.hg.1 162 0.172 0.0242 6.10 7.48e- 9 0.0000387
5 TC20001182.hg.1 162 0.411 0.0267 6.10 7.58e- 9 0.0000387
6 TC20000178.hg.1 162 0.341 0.0321 6.09 8.09e- 9 0.0000387
7 TC19001398.hg.1 162 -0.207 0.0258 -5.79 3.68e- 8 0.000151
8 TC04001402.hg.1 162 -0.296 0.0459 -5.75 4.30e- 8 0.000154
9 TC05001973.hg.1 162 -0.390 0.0231 -5.52 1.34e- 7 0.000429
10 TC09000119.hg.1 162 0.0775 0.0309 5.49 1.57e- 7 0.000452
# ... with 28,728 more rows
$EDEN
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC12000357.hg.1 109 -0.0574 0.127 -4.71 0.00000729 0.121
2 TC15000639.hg.1 109 0.135 0.0200 4.64 0.00000985 0.121
3 TC02004993.hg.1 109 0.124 0.0302 4.52 0.0000158 0.121
4 TC02001326.hg.1 109 -0.0736 0.0418 -4.51 0.0000168 0.121
5 TC15001330.hg.1 109 0.0945 0.0488 4.34 0.0000317 0.135
6 TC19001398.hg.1 109 -0.191 0.0335 -4.27 0.0000416 0.135
7 TC20000999.hg.1 109 0.115 0.0279 4.26 0.0000441 0.135
8 TC12002776.hg.1 109 -0.155 0.0504 -4.25 0.0000459 0.135
9 TC05000721.hg.1 109 -0.171 0.0266 -4.23 0.0000491 0.135
10 TC01006291.hg.1 109 0.113 0.0315 4.23 0.0000498 0.135
# ... with 28,728 more rows
$KANC
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC11001104.hg.1 151 -0.346 0.148 -6.57 7.84e-10 0.0000225
2 TC10000002.hg.1 151 0.0839 0.0192 6.34 2.51e- 9 0.0000360
3 TC01001596.hg.1 151 -0.224 0.0245 -5.93 1.96e- 8 0.000158
4 TC04001402.hg.1 151 -0.226 0.0272 -5.91 2.20e- 8 0.000158
5 TC01000909.hg.1 151 -0.147 0.0510 -5.76 4.53e- 8 0.000248
6 TC15000872.hg.1 151 0.102 0.0409 5.72 5.44e- 8 0.000248
7 TC08000814.hg.1 151 0.173 0.0296 5.70 6.03e- 8 0.000248
8 TC13000150.hg.1 151 -0.0825 0.0424 -5.66 7.56e- 8 0.000272
9 TC20000178.hg.1 151 0.257 0.0298 5.55 1.27e- 7 0.000394
10 TC05001973.hg.1 151 -0.294 0.0292 -5.53 1.37e- 7 0.000394
# ... with 28,728 more rows
$MoBA
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC19001274.hg.1 245 -0.291 0.136 -7.19 8.03e-12 0.000000231
2 TC13000150.hg.1 245 -0.0768 0.0125 -6.99 2.68e-11 0.000000385
3 TC06000733.hg.1 245 0.110 0.0184 6.81 7.63e-11 0.000000563
4 TC15001330.hg.1 245 0.0911 0.0194 6.80 7.84e-11 0.000000563
5 TC01001720.hg.1 245 -0.262 0.0229 -6.75 1.07e-10 0.000000617
6 TC19001398.hg.1 245 -0.229 0.0277 -6.56 3.19e-10 0.00000142
7 TC01001596.hg.1 245 -0.204 0.0187 -6.55 3.46e-10 0.00000142
8 TC09000119.hg.1 245 0.0859 0.0368 6.47 5.21e-10 0.00000187
9 TC10002872.hg.1 245 0.0882 0.0192 6.39 8.39e-10 0.00000268
10 TC10000002.hg.1 245 0.0677 0.0243 6.27 1.62e- 9 0.00000466
# ... with 28,728 more rows
$Rhea
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC09001482.hg.1 156 0.108 0.150 7.17 2.77e-11 0.000000487
2 TC10000002.hg.1 156 0.0843 0.0201 7.14 3.39e-11 0.000000487
3 TC12003242.hg.1 156 0.132 0.0220 6.52 9.25e-10 0.00000886
4 TC15000872.hg.1 156 0.128 0.0209 6.12 7.39e- 9 0.0000531
5 TC17000757.hg.1 156 0.0990 0.0279 5.77 4.24e- 8 0.000244
6 TC11002451.hg.1 156 0.0894 0.0380 5.63 8.26e- 8 0.000367
7 TC20001182.hg.1 156 0.351 0.0225 5.61 8.94e- 8 0.000367
8 TC20000501.hg.1 156 0.0949 0.0540 5.36 2.93e- 7 0.00105
9 TC01000598.hg.1 156 0.0911 0.0234 5.29 4.03e- 7 0.00129
10 TC13000150.hg.1 156 -0.0741 0.0375 -5.22 5.54e- 7 0.00159
# ... with 28,728 more rows
$INMASAB
# A tibble: 28,738 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC03003362.hg.1 184 -0.0779 0.127 -6.57 5.11e-10 0.0000147
2 TC09000450.hg.1 184 0.0782 0.0151 6.35 1.61e- 9 0.0000193
3 TC01006291.hg.1 184 0.114 0.0199 6.31 2.02e- 9 0.0000193
4 TC03000563.hg.1 184 -0.200 0.0225 -6.18 3.96e- 9 0.0000284
5 TC19001398.hg.1 184 -0.219 0.0319 -6.07 7.08e- 9 0.0000407
6 TC10002953.hg.1 184 0.0983 0.0333 5.62 6.96e- 8 0.000322
7 TC01000909.hg.1 184 -0.129 0.0247 -5.60 7.84e- 8 0.000322
8 TC17000757.hg.1 184 0.0793 0.0417 5.40 2.00e- 7 0.000709
9 TC10002872.hg.1 184 0.0917 0.0225 5.38 2.22e- 7 0.000709
10 TC01003403.hg.1 184 -0.182 0.0331 -5.32 2.97e- 7 0.000855
# ... with 28,728 more rows
attr(,"class")
[1] "dsLimma" "list"
We can verify whether the distribution of the observed p-values are the ones we would expect in this type of analyses:
ggplot(model.sex$BIB, aes(x=P.Value))+
geom_histogram(color="darkblue", fill="lightblue")+labs(
title="P-Value histogram plot",x="Raw p-value gender effect", y = "Count")
Another functionality of dsOmicsClient is that we can add annotation information at the probe-level to the output that is obtained with ds.limma(). Annotation information will be available as part of the expression set file “resource_expr_HELIX” and can be inspected by:
# Get annotation information for the first cohort
ds.fvarLabels("resource_expr_HELIX")[[1]]
[1] "transcript_cluster_id" "probeset_id" "seqname"
[4] "strand" "start" "stop"
[7] "total_probes" "gene_assignment" "mrna_assignment"
[10] "notes" "phase" "TC_size"
[13] "TSS_Affy" "EntrezeGeneID_Affy" "GeneSymbol_Affy"
[16] "GeneSymbolDB" "GeneSymbolDB2" "mrna_ID"
[19] "mrna_DB" "mrna_N" "notesYN"
[22] "geneYN" "genes_N" "CallRate"
[25] "fil1"
In case we were interested in adding gene symbol, entrezID and chromosome number mapping probes from each output, we can run:
model.sex_annot <- ds.limma(model = ~ e3_sex + age_sample_years + h_ethnicity_cauc,
Set = "resource_expr_HELIX",
annotCols = c("GeneSymbol_Affy","seqname",
"EntrezeGeneID_Affy"))
The results for each of the cohort (ordered by level of significance) can be accessed by:
model.sex_annot
$BIB
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1300~ 162 -0.126 0.164 -7.09 4.12e-11 1.19e-6 INTS6-AS1 chr13
2 TC2200~ 162 -0.156 0.0180 -6.74 2.72e-10 3.91e-6 OSBP2 chr22
3 TC1700~ 162 0.0900 0.0233 6.22 4.23e- 9 3.87e-5 TLK2 chr17
4 TC2000~ 162 0.172 0.0242 6.10 7.48e- 9 3.87e-5 SYCP2 chr20
5 TC2000~ 162 0.411 0.0267 6.10 7.58e- 9 3.87e-5 FRG1BP chr20
6 TC2000~ 162 0.341 0.0321 6.09 8.09e- 9 3.87e-5 FRG1BP;FRG1CP chr20
7 TC1900~ 162 -0.207 0.0258 -5.79 3.68e- 8 1.51e-4 RPS4XP21 chr19
8 TC0400~ 162 -0.296 0.0459 -5.75 4.30e- 8 1.54e-4 TSPAN5 chr4
9 TC0500~ 162 -0.390 0.0231 -5.52 1.34e- 7 4.29e-4 FAXDC2 chr5
10 TC0900~ 162 0.0775 0.0309 5.49 1.57e- 7 4.52e-4 IFT74 chr9
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$EDEN
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC12000~ 109 -0.0574 0.127 -4.71 7.29e-6 0.121 CACNB3 chr12
2 TC15000~ 109 0.135 0.0200 4.64 9.85e-6 0.121 RPLP1 chr15
3 TC02004~ 109 0.124 0.0302 4.52 1.58e-5 0.121 C2orf15 chr2
4 TC02001~ 109 -0.0736 0.0418 -4.51 1.68e-5 0.121 DES chr2
5 TC15001~ 109 0.0945 0.0488 4.34 3.17e-5 0.135 MYEF2 chr15
6 TC19001~ 109 -0.191 0.0335 -4.27 4.16e-5 0.135 RPS4XP21 chr19
7 TC20000~ 109 0.115 0.0279 4.26 4.41e-5 0.135 SYCP2 chr20
8 TC12002~ 109 -0.155 0.0504 -4.25 4.59e-5 0.135 ART4 chr12
9 TC05000~ 109 -0.171 0.0266 -4.23 4.91e-5 0.135 IGIP chr5
10 TC01006~ 109 0.113 0.0315 4.23 4.98e-5 0.135 CFH chr1
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$KANC
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1100~ 151 -0.346 0.148 -6.57 7.84e-10 0.0000225 TBCEL chr11
2 TC1000~ 151 0.0839 0.0192 6.34 2.51e- 9 0.0000360 ZMYND11 chr10
3 TC0100~ 151 -0.224 0.0245 -5.93 1.96e- 8 0.000158 SWT1 chr1
4 TC0400~ 151 -0.226 0.0272 -5.91 2.20e- 8 0.000158 TSPAN5 chr4
5 TC0100~ 151 -0.147 0.0510 -5.76 4.53e- 8 0.000248 S1PR1 chr1
6 TC1500~ 151 0.102 0.0409 5.72 5.44e- 8 0.000248 MAN2A2 chr15
7 TC0800~ 151 0.173 0.0296 5.70 6.03e- 8 0.000248 LY6E chr8
8 TC1300~ 151 -0.0825 0.0424 -5.66 7.56e- 8 0.000272 DGKH chr13
9 TC2000~ 151 0.257 0.0298 5.55 1.27e- 7 0.000394 FRG1BP;FRG1CP chr20
10 TC0500~ 151 -0.294 0.0292 -5.53 1.37e- 7 0.000394 FAXDC2 chr5
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$MoBA
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC1900~ 245 -0.291 0.136 -7.19 8.03e-12 2.31e-7 PLVAP chr19
2 TC1300~ 245 -0.0768 0.0125 -6.99 2.68e-11 3.85e-7 DGKH chr13
3 TC0600~ 245 0.110 0.0184 6.81 7.63e-11 5.63e-7 DDX43 chr6
4 TC1500~ 245 0.0911 0.0194 6.80 7.84e-11 5.63e-7 MYEF2 chr15
5 TC0100~ 245 -0.262 0.0229 -6.75 1.07e-10 6.17e-7 CTSE chr1
6 TC1900~ 245 -0.229 0.0277 -6.56 3.19e-10 1.42e-6 RPS4XP21 chr19
7 TC0100~ 245 -0.204 0.0187 -6.55 3.46e-10 1.42e-6 SWT1 chr1
8 TC0900~ 245 0.0859 0.0368 6.47 5.21e-10 1.87e-6 IFT74 chr9
9 TC1000~ 245 0.0882 0.0192 6.39 8.39e-10 2.68e-6 FAM24B;CUZD1 chr10
10 TC1000~ 245 0.0677 0.0243 6.27 1.62e- 9 4.66e-6 ZMYND11 chr10
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$Rhea
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC0900~ 156 0.108 0.150 7.17 2.77e-11 4.87e-7 KIAA0368 chr9
2 TC1000~ 156 0.0843 0.0201 7.14 3.39e-11 4.87e-7 ZMYND11 chr10
3 TC1200~ 156 0.132 0.0220 6.52 9.25e-10 8.86e-6 PCBP2 chr12
4 TC1500~ 156 0.128 0.0209 6.12 7.39e- 9 5.31e-5 MAN2A2 chr15
5 TC1700~ 156 0.0990 0.0279 5.77 4.24e- 8 2.44e-4 TLK2 chr17
6 TC1100~ 156 0.0894 0.0380 5.63 8.26e- 8 3.67e-4 ZBTB44 chr11
7 TC2000~ 156 0.351 0.0225 5.61 8.94e- 8 3.67e-4 FRG1BP chr20
8 TC2000~ 156 0.0949 0.0540 5.36 2.93e- 7 1.05e-3 GID8 chr20
9 TC0100~ 156 0.0911 0.0234 5.29 4.03e- 7 1.29e-3 CMPK1 chr1
10 TC1300~ 156 -0.0741 0.0375 -5.22 5.54e- 7 1.59e-3 DGKH chr13
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
$INMASAB
# A tibble: 28,738 x 10
id n beta SE t P.Value adj.P.Val GeneSymbol_Affy seqname
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 TC0300~ 184 -0.0779 0.127 -6.57 5.11e-10 0.0000147 PHLDB2 chr3
2 TC0900~ 184 0.0782 0.0151 6.35 1.61e- 9 0.0000193 PHF2 chr9
3 TC0100~ 184 0.114 0.0199 6.31 2.02e- 9 0.0000193 CFH chr1
4 TC0300~ 184 -0.200 0.0225 -6.18 3.96e- 9 0.0000284 CD96 chr3
5 TC1900~ 184 -0.219 0.0319 -6.07 7.08e- 9 0.0000407 RPS4XP21 chr19
6 TC1000~ 184 0.0983 0.0333 5.62 6.96e- 8 0.000322 CUZD1 chr10
7 TC0100~ 184 -0.129 0.0247 -5.60 7.84e- 8 0.000322 S1PR1 chr1
8 TC1700~ 184 0.0793 0.0417 5.40 2.00e- 7 0.000709 TLK2 chr17
9 TC1000~ 184 0.0917 0.0225 5.38 2.22e- 7 0.000709 FAM24B;CUZD1 chr10
10 TC0100~ 184 -0.182 0.0331 -5.32 2.97e- 7 0.000855 SLAMF6 chr1
# ... with 28,728 more rows, and 1 more variable: EntrezeGeneID_Affy <chr>
attr(,"class")
[1] "dsLimma" "list"
We can also check whether there is inflation just executing:
qqplot(model.sex$BIB$P.Value)
So, in that case, the model does not need to remove unwanted variability (λ<2). If that was not the case, we could use surrogate variable analysis just changing the argument sva=TRUE..
The ds.limma function has another arguments that can be used to fit other type of models:
Then, data can be combined meta-analysed as follows:
metaP.model.sex.annot <- metaPvalues(model.sex_annot)
metaP.model.sex.annot
# A tibble: 28,738 x 8
id BIB EDEN KANC MoBA Rhea INMASAB p.meta
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 TC1000000~ 3.10e- 6 8.20e-4 2.51e-9 1.62e- 9 3.39e-11 1.09e-5 2.63e-34
2 TC1900139~ 3.68e- 8 4.16e-5 1.92e-7 3.19e-10 6.26e- 5 7.08e-9 2.54e-33
3 TC2000118~ 7.58e- 9 7.07e-5 5.48e-7 8.83e- 9 8.94e- 8 8.94e-7 1.16e-32
4 TC1700075~ 4.23e- 9 7.24e-4 9.59e-6 3.60e- 9 4.24e- 8 2.00e-7 4.64e-32
5 TC1500087~ 4.20e- 6 6.67e-5 5.44e-8 3.83e- 8 7.39e- 9 8.93e-7 1.84e-31
6 TC2000017~ 8.09e- 9 9.86e-5 1.27e-7 9.23e- 8 1.26e- 6 1.46e-6 7.53e-31
7 TC1300109~ 4.12e-11 1.01e-4 3.58e-5 3.58e- 7 1.02e- 6 1.88e-5 3.56e-29
8 TC0100090~ 5.82e- 5 8.74e-4 4.53e-8 3.43e- 9 2.77e- 6 7.84e-8 5.75e-29
9 TC0900148~ 7.50e- 6 1.11e-3 1.74e-5 2.15e- 6 2.77e-11 8.15e-7 2.17e-28
10 TC1300015~ 5.68e- 6 8.56e-3 7.56e-8 2.68e-11 5.54e- 7 1.97e-4 3.22e-28
# ... with 28,728 more rows
# Get the number of differentially expressed genes in the meta approach
dim(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<0.05,])
[1] 3839 8
Once we have obtained the top differentially expressed genes per cohort, we could extract their gene symbols or gene entrez ids directly from each output dataframe and continue with the functional annotation analysis in our local session as usual. To this end, one of the classic approaches is to evaluate whether any information (in the form of annotations) is over-represented in the gene list compared to the rest of the genes in the genome. This type of enrichment analysis is based on annotating genes with information available in different databases such as Gene Ontology or the Kyoto Encyclopedia of Genes and Genomes, and establishing the frequencies of each term in the gene list and the rest of the genome. This allows a statistical test to be applied to determine which functional annotations are significantly enriched in the list.
To extract the list of top differentially expressed genes by cohort and perform enrichment analysis with KEGG we can run the next code lines:
# Get the entrez id for the top differentially expressed genes in each cohort.
get_entrezid <- function(x){ x[(x$adj.P.Val < 0.05),c("id","EntrezeGeneID_Affy")] }
topgenes_by_cohort <- lapply(model.sex_annot,get_entrezid)
length(unlist(lapply(topgenes_by_cohort,function(x) x[,2])))
[1] 666
# Get the entrez id for the top differentially expressed genes in the meta-analysis
# approach
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
filters <- listFilters(ensembl)
attributes <- listAttributes(ensembl)
affy_id <- gsub(x = as.character(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<
0.05/nrow(metaP.model.sex.annot),1]$id), pattern = "hg.1", "hg")
# Number of probes passing multiple-testing correction in the meta-analysis.
length(affy_id)
[1] 325
# Get the entrez id of the genes mapped by the probes from the previous list.
topgenes_metaP <- getBM(attributes = c("entrezgene_id"),
filters = "affy_hta_2_0",
values = affy_id,
mart = ensembl)
length(topgenes_metaP[,1])
[1] 276
# Functional enrichment analysis with KEGG.
# We will use as reference a list of genes loaded from the DOSE package.
data(geneList, package="DOSE")
# Enrichment for top genes from each cohort
enrichKEGG <- enrichKEGG(gene = unlist(lapply(topgenes_by_cohort,function(x) x[,2])),
organism = 'hsa',
pvalueCutoff = 0.1)
head(enrichKEGG)
ID Description GeneRatio
hsa03010 hsa03010 Ribosome 20/183
hsa03008 hsa03008 Ribosome biogenesis in eukaryotes 16/183
hsa04211 hsa04211 Longevity regulating pathway 10/183
hsa04213 hsa04213 Longevity regulating pathway - multiple species 7/183
hsa04150 hsa04150 mTOR signaling pathway 11/183
hsa04152 hsa04152 AMPK signaling pathway 9/183
BgRatio pvalue p.adjust qvalue
hsa03010 158/8135 2.903164e-10 7.490164e-08 6.295282e-08
hsa03008 109/8135 2.144118e-09 2.765912e-07 2.324675e-07
hsa04211 89/8135 2.831858e-05 2.435398e-03 2.046887e-03
hsa04213 62/8135 4.497217e-04 2.900705e-02 2.437965e-02
hsa04150 156/8135 7.577933e-04 3.910213e-02 3.286430e-02
hsa04152 120/8135 1.483310e-03 5.812931e-02 4.885613e-02
geneID
hsa03010 6176/6171/6208/100169751/100169753/100169754/100169755/100169756/100169757/100169758/100169759/100169761/100169762/100169763/100169764/100169765/100169766/100169767/100169768/23521
hsa03008 100169751/100169753/100169754/100169755/100169756/100169757/100169758/100169759/100169761/100169762/100169763/100169764/100169765/100169766/100169767/100169768
hsa04211 5567/5562/6198/27244/9586/64764/6648/3480/10000/207
hsa04213 5567/5562/6198/6648/3480/10000/207
hsa04150 523/6446/5562/6198/7477/1975/3480/5578/10000/207/9706
hsa04152 5562/6198/55844/1938/9586/64764/3480/10000/207
Count
hsa03010 20
hsa03008 16
hsa04211 10
hsa04213 7
hsa04150 11
hsa04152 9
# Enrichment for top genes from the meta-analysis
enrichKEGG_meta <- enrichKEGG(gene = as.character(topgenes_metaP[,1]),
organism = 'hsa',
pvalueCutoff = 0.1)
head(enrichKEGG_meta)
ID Description GeneRatio BgRatio pvalue
hsa04211 hsa04211 Longevity regulating pathway 7/117 89/8135 0.0002767746
p.adjust qvalue geneID Count
hsa04211 0.06393493 0.0614731 9586/6198/3480/26060/27244/5562/5567 7
# The pathways can be visualized alongside the genes associated with them:
browseKEGG(enrichKEGG, 'hsa04211')
As a result, we identify a list of 325 probes (mapping 287 genes) differentially expressed between boys and girls of the HELIX cohort, passing multiple-testing correction filter (P-value threshold=1.74e-06). As a proof of the ability of OmicSHIELD to be integrated with other R functionalities and bioconductor utilties, we continued the showcase presenting a functional enrichment analysis (FEA) of results, showing that significant DE genes participate in processes with evident and previously-described sexual dimorphism such is the case of “Longevity regulating pathways” (hsa04211).
Once analyses are completed, we proceed to close the DataSHIELD session by:
datashield.logout(conns)
In this section, we will illustrate how to perform DNA methylation differential analysis using real data from the HELIX project. Particularly, we will analyse microarray data deriving from the Infinium HumanMethylation450k platform of Illumina. On these data, we will show how to perform a epigenome-wide association analysis (EWAS) to compare the DNA methylation profiles differing between boys and girls in the HELIX cohorts. We will illustrate the following use cases:
In comparison to gene expression data, where data are not always normalized in the same way (especially in the case of RNAseq data), when analyzing EWAS data one count on normalized Beta or M-Values. This favors the harmonization of the dataset between cohorts, thereby making possible to perform a pooled analysis instead of a meta-analysis approach.
In this section, we will describe how to configure R and DataSHIELD with the needed packages to carry out proposed analyses in remote. We start by installing the client-side version of the following DataSHIELD/Opal integration packages.
install.packages("DSOpal", dependencies=TRUE)
install.packages("DSI", dependencies=TRUE)
Make sure you also install the DataSHIELD client-side version of the package dsBaseClient.
install.packages("dsBaseClient",
repos = c("http://cran.datashield.org","https://cloud.r-project.org/"),
dependencies = TRUE)
Then, install the client-side version of the dsOmicsClient package directly from GitHub.
install.packages("devtools")
require("devtools")
devtools::install_github("isglobal-brge/dsOmicsClient",ref="master",
auth_token = "INSERT_YOUR_GITHUB_TOKEN_HERE")
Once installations are completed, all the packages are loaded as usual.
require(DSOpal)
require(DSI)
require(dsBaseClient)
require(dsOmicsClient)
# Loading additional required packages (if not installed, you can easly install them using the BiocManager::install() function)
require(clusterProfiler)
require(org.Hs.eg.db)
In this section, we will cover how to load and inspect input microarray data with DataSHIELD. We start by creating the connection to the opal server using an user who have DataSHIELD permissions.
builder <- DSI::newDSLoginBuilder()
builder$append(server = "BIB", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678"
, profile = "rock-inma")
builder$append(server = "EDEN", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "KANC", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "MoBA", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "Rhea", url = "https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
builder$append(server = "INMASAB", url ="https://datashield.isglobal.org/repo",
user = "invited", password = "12345678",
profile = "rock-inma")
logindata <- builder$build()
conns <- DSI::datashield.login(logins = logindata)
Now that we have created a connection object to the Opal, we have started a new R session on the server, and our analysis will take place in this remote session, so we have to load the data into it. In this use case, available data are in the form GenomicRatioSet, a extend of the class SummarizedExperiment especially designed for DNA methylation data obtained with the Infinium HumanMethylation450k platform and usually analysed with the associated R package minfi. This type of data usually contain pre-processed DNA methylation values at a genome-wide level, in the form of M or/and Beta values, together with the associated genomic coordinates. As in the case of ExpressionSet types, GenomicRatioSet can also incorporate phenotype and metadata information.
For our showcase, one GenomicRatioSet file per cohort is hosted in the Opal BRGE site in the form of a resource, all organized within the Opal server in a project called HELIX_omics.
For loading these data into the remote DataSHIELD session we will use the DSI::datashield.assign.resource() function. This function takes the connections to the server created in the previous code chunk to assign all available resource objects from a project in the Opal to an R object in the DataSHIELD remote session. As it can be seen in the code, resources in DataSHIELD are called in the function following the next structure “NameOfOpalProject.NameOfResource”.
# We assign post-natal data from all cohorts to an object called resource_pos
DSI::datashield.assign.resource(conns[1], "assinged_resource_DNAm",
"HELIX_omics.methy_BIB")
DSI::datashield.assign.resource(conns[2], "assinged_resource_DNAm",
"HELIX_omics.methy_EDE")
DSI::datashield.assign.resource(conns[3], "assinged_resource_DNAm",
"HELIX_omics.methy_KAN")
DSI::datashield.assign.resource(conns[4], "assinged_resource_DNAm",
"HELIX_omics.methy_MOB")
DSI::datashield.assign.resource(conns[5], "assinged_resource_DNAm",
"HELIX_omics.methy_RHE")
DSI::datashield.assign.resource(conns[6], "assinged_resource_DNAm",
"HELIX_omics.methy_SAB")
Then, we have to resolve the resources and retrieve the data in the remote session (server-side). For that, we will use the DSI::datashield.assign.expr() function. As a result, we will get an R object (here named “resource_DNAm”) containing the available GenomicRatioSet files for cohorts. For created objects, we can extract some useful information including the dimension, class, and the name of the CpGs analysed in the microarray and additional phenotype data available.
# We resolve the resource
DSI::datashield.assign.expr(conns = conns, symbol = "resource_DNAm",
expr = as.symbol("as.resource.object(assinged_resource_DNAm)"))
ds.dim("resource_DNAm")
ds.class("resource_DNAm")
#The names of the CpGs included in the array can be extracted using
# the same function used for extracting probe names in the case of analyzing
# an ExpressionSet:
name.features <- ds.featureNames("resource_DNAm")
lapply(name.features, head)[[1]]
#Experimental phenotypes variables can be obtained by:
name.phenotypes <- ds.varLabels("resource_DNAm")
lapply(name.phenotypes, head)[[1]]
#Also the annotation information can be obtained by:
name.annot <- ds.fvarLabels("resource_DNAm")
lapply(name.annot, head)[[1]]
If we want to perform a quick full EWAS analysis in DataSHIELD we must run a meta-analysis approach (analyzing each cohort separately). For that, we can use the same function than in the showcase of transcriptomic analysis (ds.limma()).
meta.model.sex <- ds.limma(model = ~ e3_sex,
Set = "resource_DNAm",
datasources = conns)
Then, we can visualize the top significant CpGs at each study (i.e server) by:
lapply(meta.model.sex, head)
$BIB
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg15035382 162 0.0482 0.0117 4.72 0.00000502 0.936
2 cg08308934 162 0.0239 0.00324 4.67 0.00000626 0.936
3 cg06166863 162 -0.0239 0.00860 -4.48 0.0000138 1.00
4 cg19617080 162 0.0158 0.00466 4.20 0.0000437 1.00
5 cg26328510 162 -0.0181 0.0113 -4.17 0.0000500 1.00
6 cg20041381 162 0.0184 0.00445 4.12 0.0000593 1.00
$EDEN
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg22339219 82 -0.0239 0.0157 -4.58 0.0000156 0.999
2 cg26256793 82 0.0217 0.00932 4.40 0.0000308 0.999
3 cg22672067 82 -0.0256 0.0103 -4.39 0.0000321 0.999
4 cg04920428 82 -0.0303 0.00518 -4.38 0.0000336 0.999
5 cg24052148 82 -0.0249 0.0139 -4.34 0.0000390 0.999
6 cg11822659 82 -0.0299 0.00676 -4.34 0.0000394 0.999
$KANC
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg12052203 146 0.107 0.0131 4.93 0.00000218 0.473
2 cg22266749 146 0.0650 0.00323 4.84 0.00000316 0.473
3 cg21411366 146 0.0368 0.00901 4.22 0.0000429 1.00
4 cg16216407 146 0.0329 0.00425 4.12 0.0000638 1.00
5 cg08093323 146 0.0220 0.0124 4.08 0.0000730 1.00
6 cg23108580 146 -0.0321 0.00414 -4.04 0.0000853 1.00
$MoBA
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg12205554 196 -0.0138 0.0102 -4.26 0.0000320 1.00
2 cg06998765 196 -0.0238 0.00267 -4.19 0.0000418 1.00
3 cg25240363 196 0.0112 0.00806 4.17 0.0000462 1.00
4 cg11643285 196 -0.0282 0.00343 -4.15 0.0000500 1.00
5 cg27191131 196 -0.0126 0.00998 -4.06 0.0000714 1.00
6 cg27150870 196 0.0220 0.00384 4.05 0.0000725 1.00
$Rhea
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg15777472 156 0.0328 0.0136 4.73 0.00000485 1.00
2 cg19249678 156 0.0126 0.00334 4.55 0.0000106 1.00
3 cg21377936 156 0.0223 0.00939 4.54 0.0000111 1.00
4 cg12052203 156 0.0800 0.00385 4.41 0.0000193 1.00
5 cg12613618 156 0.0373 0.0110 4.32 0.0000272 1.00
6 cg06544141 156 0.0142 0.00454 4.31 0.0000280 1.00
$INMASAB
# A tibble: 6 x 7
id n beta SE t P.Value adj.P.Val
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg05293407 176 0.0189 0.0114 4.70 0.00000511 1.00
2 cg02802904 176 0.0125 0.00500 4.35 0.0000232 1.00
3 cg22810281 176 -0.0178 0.00893 -4.30 0.0000281 1.00
4 cg18126027 176 -0.0134 0.00421 -4.17 0.0000479 1.00
5 cg03368690 176 -0.0146 0.0108 -4.12 0.0000576 1.00
6 cg03691818 176 -0.0331 0.00368 -4.11 0.0000597 1.00
As in the case of gene expression analysis, annotation columns can be added to the output by using the argument annotCols. It should be a vector with the columns of the annotation available in the GenomeRatioSet that want to be showed. To obtain the available annotation columns use the function ds.fvarLabels().
meta.model.sex.annot <- ds.limma(model = ~ e3_sex,
Set = "resource_DNAm",
annotCols = c("chromosome","start","end","UCSC_RefGene_Name"),
datasources = conns)
lapply(meta.model.sex.annot, head)
$BIB
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg150~ 162 0.0482 0.0117 4.72 5.02e-6 0.936 chr19 4.07e7 4.07e7
2 cg083~ 162 0.0239 0.00324 4.67 6.26e-6 0.936 chr14 9.20e7 9.20e7
3 cg061~ 162 -0.0239 0.00860 -4.48 1.38e-5 1.00 chr14 3.96e7 3.96e7
4 cg196~ 162 0.0158 0.00466 4.20 4.37e-5 1.00 chr3 1.83e8 1.83e8
5 cg263~ 162 -0.0181 0.0113 -4.17 5.00e-5 1.00 chr10 1.11e7 1.11e7
6 cg200~ 162 0.0184 0.00445 4.12 5.93e-5 1.00 chr1 1.86e8 1.86e8
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$EDEN
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg223~ 82 -0.0239 0.0157 -4.58 1.56e-5 0.999 chr7 1.05e8 1.05e8
2 cg262~ 82 0.0217 0.00932 4.40 3.08e-5 0.999 chr1 1.04e8 1.04e8
3 cg226~ 82 -0.0256 0.0103 -4.39 3.21e-5 0.999 chr15 7.45e7 7.45e7
4 cg049~ 82 -0.0303 0.00518 -4.38 3.36e-5 0.999 chr17 5.77e7 5.77e7
5 cg240~ 82 -0.0249 0.0139 -4.34 3.90e-5 0.999 chr19 4.07e6 4.07e6
6 cg118~ 82 -0.0299 0.00676 -4.34 3.94e-5 0.999 chr20 5.31e7 5.31e7
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$KANC
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg120~ 146 0.107 0.0131 4.93 2.18e-6 0.473 chr11 6.61e7 6.61e7
2 cg222~ 146 0.0650 0.00323 4.84 3.16e-6 0.473 chr4 1.10e8 1.10e8
3 cg214~ 146 0.0368 0.00901 4.22 4.29e-5 1.00 chr4 2.48e7 2.48e7
4 cg162~ 146 0.0329 0.00425 4.12 6.38e-5 1.00 chr4 2.48e7 2.48e7
5 cg080~ 146 0.0220 0.0124 4.08 7.30e-5 1.00 chr11 3.24e6 3.24e6
6 cg231~ 146 -0.0321 0.00414 -4.04 8.53e-5 1.00 chr12 3.98e7 3.98e7
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$MoBA
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg122~ 196 -0.0138 0.0102 -4.26 3.20e-5 1.00 chr10 1.57e6 1.57e6
2 cg069~ 196 -0.0238 0.00267 -4.19 4.18e-5 1.00 chr14 7.54e7 7.54e7
3 cg252~ 196 0.0112 0.00806 4.17 4.62e-5 1.00 chr17 2.80e7 2.80e7
4 cg116~ 196 -0.0282 0.00343 -4.15 5.00e-5 1.00 chr3 1.64e7 1.64e7
5 cg271~ 196 -0.0126 0.00998 -4.06 7.14e-5 1.00 chr3 1.34e8 1.34e8
6 cg271~ 196 0.0220 0.00384 4.05 7.25e-5 1.00 chr16 8.94e7 8.94e7
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$Rhea
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg1577~ 156 0.0328 0.0136 4.73 4.85e-6 1.00 chr11 6.57e7 6.57e7
2 cg1924~ 156 0.0126 0.00334 4.55 1.06e-5 1.00 chr7 1.57e8 1.57e8
3 cg2137~ 156 0.0223 0.00939 4.54 1.11e-5 1.00 chr7 4.77e6 4.77e6
4 cg1205~ 156 0.0800 0.00385 4.41 1.93e-5 1.00 chr11 6.61e7 6.61e7
5 cg1261~ 156 0.0373 0.0110 4.32 2.72e-5 1.00 chr11 6.37e7 6.37e7
6 cg0654~ 156 0.0142 0.00454 4.31 2.80e-5 1.00 chr1 2.73e7 2.73e7
# ... with 1 more variable: UCSC_RefGene_Name <chr>
$INMASAB
# A tibble: 6 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg052~ 176 0.0189 0.0114 4.70 5.11e-6 1.00 chr6 2.89e7 2.89e7
2 cg028~ 176 0.0125 0.00500 4.35 2.32e-5 1.00 chr10 1.31e8 1.31e8
3 cg228~ 176 -0.0178 0.00893 -4.30 2.81e-5 1.00 chr6 3.19e7 3.19e7
4 cg181~ 176 -0.0134 0.00421 -4.17 4.79e-5 1.00 chr6 1.61e8 1.61e8
5 cg033~ 176 -0.0146 0.0108 -4.12 5.76e-5 1.00 chr16 2.73e7 2.73e7
6 cg036~ 176 -0.0331 0.00368 -4.11 5.97e-5 1.00 chr12 5.31e7 5.31e7
# ... with 1 more variable: UCSC_RefGene_Name <chr>
Up to this point, we have obtained association results for each cohort on the study. The next step is to combine this information using meta-analysis methods to derive a pooled estimate closest to the common truth. Each researcher might have an already built pipeline to do so, or a preferred method; nevertheless, we included a couple methods inside dsOmicsClient. They are the following:
metaP.model.sex.annot <- metaPvalues(meta.model.sex.annot)
metaP.model.sex.annot
# A tibble: 299,221 x 8
id BIB EDEN KANC MoBA Rhea INMASAB p.meta
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg12052203 0.423 0.0522 0.00000218 0.0257 0.0000193 0.0129 1.72e-10
2 cg22266749 0.571 0.0193 0.00000316 0.0668 0.000117 0.0399 3.67e- 9
3 cg11643285 0.00267 0.909 0.0322 0.0000500 0.000136 0.0345 5.71e- 9
4 cg03691818 0.0115 0.674 0.0291 0.00979 0.0207 0.0000597 3.71e- 7
5 cg07210061 0.351 0.00238 0.358 0.00365 0.0331 0.00414 9.33e- 6
6 cg25698541 0.00102 0.129 0.807 0.00271 0.107 0.0133 2.06e- 5
7 cg15422033 0.317 0.0496 0.00381 0.887 0.000845 0.0108 2.35e- 5
8 cg23108580 0.578 0.00227 0.0000853 0.0495 0.115 0.852 2.57e- 5
9 cg12613618 0.158 0.0988 0.00425 0.472 0.0000272 0.689 2.73e- 5
10 cg24537724 0.165 0.754 0.000380 0.0265 0.0297 0.0171 2.90e- 5
# ... with 299,211 more rows
We can create a QQ-plot by using the function qqplot() available in our package.
qqplot(metaP.model.sex.annot$p.meta)
Here, as can be observed, there is no need to remove unwanted variability (λ<2). Nevertheless, we will illustrate how to proceed in case of observing inflation.
The vast majority of omic studies require to control for unwanted variability. The surrogate variable analysis can address this issue by estimating some hidden covariates that capture differences across individuals due to some artifacts such as batch effects or sample quality among others. The method is implemented in SVA package.
Performing this type of analysis using the ds.lmFeature function is not allowed since estimating SVA would require to implement a non-disclosive method that computes SVA from the different servers. This will be a future topic of the dsOmicsClient. For that reason we have to adopt a compromise solution which is to perform the SVA independently at each study. We use the ds.limma function to perform the analyses adjusted for SVA at each study. Especially for the case of EWAS data this kind of analysis is important since it is usual to find important sources of unwanted variability affecting the global levels of methylation (e.g., the existence of different white-cell proportions when the sample type under analysis is whole-blood)
meta.model.sex.annot.sva <- ds.limma(model = ~ e3_sex,
Set = "resource_DNAm",
sva = TRUE,
annotCols = c("chromosome","start","end","UCSC_RefGene_Name"))
meta.model.sex.annot.sva
$BIB
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg150~ 162 0.0482 0.0117 4.72 5.02e-6 0.936 chr19 4.07e7 4.07e7
2 cg083~ 162 0.0239 0.00324 4.67 6.26e-6 0.936 chr14 9.20e7 9.20e7
3 cg061~ 162 -0.0239 0.00860 -4.48 1.38e-5 1.00 chr14 3.96e7 3.96e7
4 cg196~ 162 0.0158 0.00466 4.20 4.37e-5 1.00 chr3 1.83e8 1.83e8
5 cg263~ 162 -0.0181 0.0113 -4.17 5.00e-5 1.00 chr10 1.11e7 1.11e7
6 cg200~ 162 0.0184 0.00445 4.12 5.93e-5 1.00 chr1 1.86e8 1.86e8
7 cg276~ 162 -0.0143 0.00425 -4.06 7.64e-5 1.00 chr20 2.45e6 2.45e6
8 cg049~ 162 -0.0134 0.00354 -4.05 7.83e-5 1.00 chr1 2.34e8 2.34e8
9 cg090~ 162 -0.0141 0.00471 -4.01 9.15e-5 1.00 chr16 8.10e7 8.10e7
10 cg257~ 162 -0.0414 0.00373 -4.01 9.21e-5 1.00 chr18 2.86e7 2.86e7
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
$EDEN
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg223~ 82 -0.0239 0.0157 -4.58 1.56e-5 0.999 chr7 1.05e8 1.05e8
2 cg262~ 82 0.0217 0.00932 4.40 3.08e-5 0.999 chr1 1.04e8 1.04e8
3 cg226~ 82 -0.0256 0.0103 -4.39 3.21e-5 0.999 chr15 7.45e7 7.45e7
4 cg049~ 82 -0.0303 0.00518 -4.38 3.36e-5 0.999 chr17 5.77e7 5.77e7
5 cg240~ 82 -0.0249 0.0139 -4.34 3.90e-5 0.999 chr19 4.07e6 4.07e6
6 cg118~ 82 -0.0299 0.00676 -4.34 3.94e-5 0.999 chr20 5.31e7 5.31e7
7 cg231~ 82 0.0283 0.00631 4.30 4.47e-5 0.999 chr5 1.77e8 1.77e8
8 cg042~ 82 0.0246 0.00559 4.20 6.58e-5 0.999 chr12 1.09e7 1.09e7
9 cg005~ 82 -0.0274 0.00624 -4.20 6.60e-5 0.999 chr19 5.00e7 5.00e7
10 cg223~ 82 0.0306 0.00985 4.19 6.69e-5 0.999 chr19 6.21e6 6.21e6
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
$KANC
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg120~ 146 0.107 0.0131 4.93 2.18e-6 0.473 chr11 6.61e7 6.61e7
2 cg222~ 146 0.0650 0.00323 4.84 3.16e-6 0.473 chr4 1.10e8 1.10e8
3 cg214~ 146 0.0368 0.00901 4.22 4.29e-5 1.00 chr4 2.48e7 2.48e7
4 cg162~ 146 0.0329 0.00425 4.12 6.38e-5 1.00 chr4 2.48e7 2.48e7
5 cg080~ 146 0.0220 0.0124 4.08 7.30e-5 1.00 chr11 3.24e6 3.24e6
6 cg231~ 146 -0.0321 0.00414 -4.04 8.53e-5 1.00 chr12 3.98e7 3.98e7
7 cg099~ 146 0.0169 0.00407 4.00 1.01e-4 1.00 chr2 2.19e8 2.19e8
8 cg135~ 146 -0.0163 0.00439 -3.98 1.07e-4 1.00 chr3 5.23e7 5.23e7
9 cg275~ 146 -0.0132 0.00451 -3.97 1.10e-4 1.00 chr17 1.89e7 1.89e7
10 cg008~ 146 0.0195 0.00400 3.97 1.12e-4 1.00 chr3 5.22e7 5.22e7
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
$MoBA
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg122~ 196 -0.0138 0.0102 -4.26 3.20e-5 1.00 chr10 1.57e6 1.57e6
2 cg069~ 196 -0.0238 0.00267 -4.19 4.18e-5 1.00 chr14 7.54e7 7.54e7
3 cg252~ 196 0.0112 0.00806 4.17 4.62e-5 1.00 chr17 2.80e7 2.80e7
4 cg116~ 196 -0.0282 0.00343 -4.15 5.00e-5 1.00 chr3 1.64e7 1.64e7
5 cg271~ 196 -0.0126 0.00998 -4.06 7.14e-5 1.00 chr3 1.34e8 1.34e8
6 cg271~ 196 0.0220 0.00384 4.05 7.25e-5 1.00 chr16 8.94e7 8.94e7
7 cg263~ 196 0.0115 0.00380 4.04 7.66e-5 1.00 chr11 1.02e8 1.02e8
8 cg171~ 196 -0.0157 0.00351 -4.03 7.95e-5 1.00 chr7 9.92e7 9.92e7
9 cg158~ 196 -0.0106 0.00447 -4.02 8.16e-5 1.00 chr14 1.03e8 1.03e8
10 cg068~ 196 0.0129 0.00355 3.97 9.94e-5 1.00 chr5 4.07e7 4.07e7
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
$Rhea
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg157~ 156 0.0328 0.0136 4.73 4.85e-6 1.00 chr11 6.57e7 6.57e7
2 cg192~ 156 0.0126 0.00334 4.55 1.06e-5 1.00 chr7 1.57e8 1.57e8
3 cg213~ 156 0.0223 0.00939 4.54 1.11e-5 1.00 chr7 4.77e6 4.77e6
4 cg120~ 156 0.0800 0.00385 4.41 1.93e-5 1.00 chr11 6.61e7 6.61e7
5 cg126~ 156 0.0373 0.0110 4.32 2.72e-5 1.00 chr11 6.37e7 6.37e7
6 cg065~ 156 0.0142 0.00454 4.31 2.80e-5 1.00 chr1 2.73e7 2.73e7
7 cg231~ 156 -0.0572 0.00457 -4.30 2.95e-5 1.00 chr6 1.68e8 1.68e8
8 cg254~ 156 -0.0170 0.00357 -4.28 3.19e-5 1.00 chr6 5.63e7 5.63e7
9 cg189~ 156 -0.0201 0.00521 -4.23 3.99e-5 1.00 chr15 4.41e7 4.41e7
10 cg181~ 156 0.0160 0.00396 4.16 5.13e-5 1.00 chr1 1.19e7 1.19e7
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
$INMASAB
# A tibble: 299,221 x 11
id n beta SE t P.Value adj.P.Val chromosome start end
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int> <int>
1 cg052~ 176 0.0189 0.0114 4.70 5.11e-6 1.00 chr6 2.89e7 2.89e7
2 cg028~ 176 0.0125 0.00500 4.35 2.32e-5 1.00 chr10 1.31e8 1.31e8
3 cg228~ 176 -0.0178 0.00893 -4.30 2.81e-5 1.00 chr6 3.19e7 3.19e7
4 cg181~ 176 -0.0134 0.00421 -4.17 4.79e-5 1.00 chr6 1.61e8 1.61e8
5 cg033~ 176 -0.0146 0.0108 -4.12 5.76e-5 1.00 chr16 2.73e7 2.73e7
6 cg036~ 176 -0.0331 0.00368 -4.11 5.97e-5 1.00 chr12 5.31e7 5.31e7
7 cg131~ 176 0.0160 0.00384 3.96 1.09e-4 1.00 chr12 5.00e7 5.00e7
8 cg185~ 176 -0.0107 0.00356 -3.96 1.09e-4 1.00 chr5 1.73e8 1.73e8
9 cg227~ 176 -0.0145 0.00416 -3.91 1.33e-4 1.00 chr15 4.41e7 4.41e7
10 cg062~ 176 0.0132 0.00341 3.89 1.39e-4 1.00 chr17 1.98e7 1.98e7
# ... with 299,211 more rows, and 1 more variable: UCSC_RefGene_Name <chr>
attr(,"class")
[1] "dsLimma" "list"
Then, data can be combined with meta-analysed as follows:
metaP.model.sex.annot <- metaPvalues(meta.model.sex.annot.sva)
metaP.model.sex.annot
# A tibble: 299,221 x 8
id BIB EDEN KANC MoBA Rhea INMASAB p.meta
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cg12052203 0.423 0.0522 0.00000218 0.0257 0.0000193 0.0129 1.72e-10
2 cg22266749 0.571 0.0193 0.00000316 0.0668 0.000117 0.0399 3.67e- 9
3 cg11643285 0.00267 0.909 0.0322 0.0000500 0.000136 0.0345 5.71e- 9
4 cg03691818 0.0115 0.674 0.0291 0.00979 0.0207 0.0000597 3.71e- 7
5 cg07210061 0.351 0.00238 0.358 0.00365 0.0331 0.00414 9.33e- 6
6 cg25698541 0.00102 0.129 0.807 0.00271 0.107 0.0133 2.06e- 5
7 cg15422033 0.317 0.0496 0.00381 0.887 0.000845 0.0108 2.35e- 5
8 cg23108580 0.578 0.00227 0.0000853 0.0495 0.115 0.852 2.57e- 5
9 cg12613618 0.158 0.0988 0.00425 0.472 0.0000272 0.689 2.73e- 5
10 cg24537724 0.165 0.754 0.000380 0.0265 0.0297 0.0171 2.90e- 5
# ... with 299,211 more rows
# Get the number of differentially methylated probes in the meta approach
dim(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<0.05,])
[1] 10417 8
# Get the number of differentially expressed genes in the meta approach
# (multiple-testing correction)
dim(metaP.model.sex.annot[metaP.model.sex.annot$p.meta<0.05/nrow(metaP.model.sex.annot),])
[1] 3 8
As a result, from the initial list of almost 300k CpGs, we identify a list of 10,417 DMPs between boys and girls of the HELIX cohorts, from which only 3 passed the strict bonferroni multiple-testing correction. Of these 3, the cg12052203 and cg25650246 (mapping the B3GNT1 and RFTN1 respectively) have been previously associated with sex methylation differences (http://www.ewascatalog.org).
And we can revisit the qqplot:
qqplot(metaP.model.sex.annot$p.meta)
Once we have obtained the top differentially methylated genes per cohort, we could extract their gene symbols or gene entrez ids directly from each output dataframe and continue with the functional annotation analysis (FEA) in our local session as usual. As presented in the showcase of gene expression analysis, we will conduct an enrichment analysis with KEGG:
# Get list of significant CpGs (checking the number).
sigCpGs <- as.character(metaP.model.sex.annot[which(
metaP.model.sex.annot$p.meta < 0.01),"id"][[1]])
length(sigCpGs)
[1] 1896
# Load required packages for enrichment analysis with CpGs.
require(IlluminaHumanMethylation450kanno.ilmn12.hg19)
require(missMethyl)
# Get whole list of CpGs in the 450K array.
ann450k = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
all <- ann450k@listData$Name
# Perform enrichment analysis for GO terms.
goenrichmnet_cpgs <- gometh(sig.cpg=sigCpGs, all.cpg=all,
plot.bias=FALSE,array.type="450K",collection="GO")
goenrichmnet_cpgs <- goenrichmnet_cpgs[order(goenrichmnet_cpgs$P.DE),]
head(goenrichmnet_cpgs)
ONTOLOGY
GO:0032968 BP
GO:0044752 BP
GO:0048532 BP
GO:0021675 BP
GO:0007638 BP
GO:0070593 BP
TERM
GO:0032968 positive regulation of transcription elongation from RNA polymerase II promoter
GO:0044752 response to human chorionic gonadotropin
GO:0048532 anatomical structure arrangement
GO:0021675 nerve development
GO:0007638 mechanosensory behavior
GO:0070593 dendrite self-avoidance
N DE P.DE FDR
GO:0032968 13 6 0.0006514909 1
GO:0044752 3 3 0.0018339017 1
GO:0048532 17 8 0.0019671882 1
GO:0021675 79 21 0.0021539566 1
GO:0007638 15 7 0.0024061509 1
GO:0070593 16 7 0.0026827085 1
# Perform enrichment analysis for KEGG terms.
KEGGenrichmnet_cpgs <- gometh(sig.cpg=sigCpGs, all.cpg=all,
plot.bias=FALSE,array.type="450K",
collection="KEGG")
KEGGenrichmnet_cpgs <- KEGGenrichmnet_cpgs[order(KEGGenrichmnet_cpgs$P.DE),]
head(KEGGenrichmnet_cpgs)
Description N DE
path:hsa04961 Endocrine and other factor-regulated calcium reabsorption 53 13
path:hsa05146 Amoebiasis 100 18
path:hsa00561 Glycerolipid metabolism 60 11
path:hsa04975 Fat digestion and absorption 43 7
path:hsa05031 Amphetamine addiction 69 13
path:hsa04974 Protein digestion and absorption 102 16
P.DE FDR
path:hsa04961 0.008665876 1
path:hsa05146 0.015520015 1
path:hsa00561 0.044920205 1
path:hsa04975 0.064348016 1
path:hsa05031 0.070891671 1
path:hsa04974 0.073866805 1
The FEA show that these CpGs map genes participating in processes with noticeable sex differences such is the case of bone formation regulated by endocrine factors (“Endocrine and other factor-regulated calcium reabsorption,” hsa04961).
As always, the DataSHIELD session must be closed by:
datashield.logout(conns)
TODO
TODO
TODO
The PRS function can be divided into three main blocks that perform the required tasks to come up with the results. The three functional blocks are the following:
prs_table table if PGS Catalog is used, perform calls to the server to subset the resources and calculate the PRS with the subsetted resources.prs_table table) to calculate the PRS.In order to introduce a little how those three blocks work internally, schematized flow charts have been designed. To understand the exact inner working of the functionality it is advised to follow the flowcharts alongside the actual source code. Find the links to the three blocks source code, client, server resolver and server analysis.
The client performs two basic tasks. The first one is to select between a user introduced prs_table or to retrieve the prs_table from the PGS Catalog. Please note that prs_table stands for Region Of Interest, which may not be the common term for PGS analysis but it makes sense since it refers to a table that contain the SNP’s of interest and their weight to the score. If the user introduces a custom prs_table and a PGS Catalog ID, only the introduced table will be used, discarding the PGS Catalog ID information. Once this table is assembled, the next step is to call the required functions on the study servers. First, the resource resolver is called, and after that, the function that calculates the PRS is called.
This is illustrated on the following figure.
Figure 15.1: Flow chart of the client block.
The server resolver is in charge of resolving the resources. The interesting aspect is that only a region of interest is actually assigned to the R session, this is to avoid overloading the memory with unnecessary information. There are two different methodologies to perform this subsetting, one is using chromosome names and position and the other is using the SNP id’s. Due to the technical specification of VCF files, is much easier to perform the subsetting using chromosome names and positions because there is an indexing file for the VCF files to perform fast queries of regions by position. On the other hand, to filter a VCF file using SNP id’s, the whole file has to be scanned, yielding a much slower process.
This block is illustrated on the following figure.
Figure 15.2: Flow chart of the server resolver block.
Many processes are performed inside the analysis block. For that reason, more than a flow chart, a step by step guide has been illustrated with the objects that are created (or modified) on each step. The most important step on this block is making sure that the alternate (risk) alleles match between the VCF resources and the alternate alleles stated by the prs_table or PGS Catalog.
The block is illustrated on the following figure.
Figure 15.3: Flow chart of the server analysis block.
When working with genotype data, OmicSHIELD offers the possibility of using VCF and GDS formats. When a VCF is supplied, internally is converted to a GDS to work with it, for that reason it is always better to start with a GDS to improve the performance of the pipeline.
When converting to GDS using gdsfmt, there are many compression options that will affect the final file size. This compression comes with an added cost, which is the read time. For very aggressive compression typically the reading time gets severely affected, so it is good to find a good balance. To help choosing the right compression we provide a comparison table between all the compression options, the table has been extracted from the official documentation of the gdsgmt package.
| Compression Method | Raw | ZIP | ZIP_ra | LZ4 | LZ4_ra | LZMA | LZMA_ra |
|---|---|---|---|---|---|---|---|
| Data Size (MB) | 38.1 | 1.9 | 2.1 | 2.8 | 2.9 | 1.4 | 1.4 |
| Compression Percent | 100% | 5.08% | 5.42% | 7.39% | 7.60% | 3.65% | 3.78% |
| Reading Time (second) | 0.21 | 202.64 | 2.97 | 84.43 | 0.84 | 462.1 | 29.7 |
At ISGlobal we are using the LZ4_ra compression method because it provides a very good compression level with the least effect to the reading time.
We conducted some stress tests to assess the maximum allocated RAM and HDD space consumed by the analysis portraited on this bookdown.
For epigenomics and differential gene expression analysis we portrayed the use of limma + voom. The disk space consumed by the ExpressionSets used is dependent on the array dimensions and the number of individuals, and so is the RAM allocation; this yields the following table:
| Array | Individuals | Storage [GB] | RAM [GB] |
|---|---|---|---|
| EPIC 450K | 100 | 0,34 | 3 |
| EPIC 450K | 500 | 1,7 | 12 |
| EPIC 450K | 1000 | 3,4 | 24 |
| EPIC 450K | 2000 | 6,8 | 48 |
| EPIC 850K | 100 | 0,64 | 8 |
| EPIC 850K | 500 | 3,2 | 36 |
| EPIC 850K | 1000 | 6,4 | 72 |
| EPIC 850K | 2000 | 12,8 | 144 |
This table can be summarized on the following graphs:
Figure 17.1: Disk space requirements for EPIC 450K array versus number of individuals (10 meta-variables)
Figure 17.2: Disk space requirements for EPIC 850K array versus number of individuals (10 meta-variables)
Figure 17.3: Maximum allocated RAM for EPIC 450K array versus number of individuals (1 adjustin covariable)
Figure 17.4: Maximum allocated RAM for EPIC 850K array versus number of individuals (1 adjustin covariable)
We can clearly see a lineal correlation to estimate the hardware requirements.
Genomic data is accepted in OmicSHIELD as gds files, which can be compressed using different algorithms, therefore it makes no sense to extract any relationship between individuals / number of variants and file size. Please refer to the appropiate section for information about that.
Also, the methodologies used (both pooled and meta-analysis) are not as RAM dependent as Limma analysis. This has to do with the nature of genomic data, which is typically extremely large; to analyze this kind of data we do not load it entirely on the RAM, the algorithms perform clever techniques that read portions of the data at a time, and the size of that portion can be configured by the user (see the argument snpBlock), therefore the researchers can tune the function calls if RAM limits become an issue.
On top of this information some details must be noted:
Regarding the discussion of dimensioning the servers for number of cores; this does not depend on the size and complexity of the analysis to be carried out, but some functionalities benefit from a multi-core architecture to speed up the calculations. However, it is not a constraint, as everything will run perfectly on one core. The discussion to take place is about the number of concurrent users of the servers. As a rule of thumb, we recommend having one core per user, that way things will not collapse when all users perform analyses at the same time. However, this has financial consequences. A good starting point would be to work with 8 cores minimum, but this is for the partners to decide.
DataSHIELD and OmicSHIELD have different filters that can be configured by each study server independently. This filters are automatically setted when installing the server packages (dsOmics and dsBase) and take the default values, which are designed to be sensible. This values can be configured individually through the web interface of Opal so that each cohort can establish their disclosure risk. First, we need to understand a little bit what is the function of each filter:
default.nfilter.diffP.epsilon: \(\epsilon\)-privacy level for the differential-privacy (defaults to 3)default.nfilter.diffP.resampleN: Number of resamples for the assessment of the l1-sensitivity of the differential-privacy. Largest is more safe but more slow (defaults to 3)default.nfilter.MAF: Filter for the calculation of the minor allele frequencies (MAF), MAF values lower than this threshold will not be returned (defaults to 0.05)To change this values, we have to login to the Opal web interface of our server, we can visit the Opal demo website as an example. We have to login as an administrator to change the filter settings, for the Opal demo: administrator/password are the credentials.
Once logged in, we navigate to Administration > DataSHIELD.
On this configuration page we have to locate the “Profiles” section (information about them here), there we will find the different analysis profiles of our server, we can set up different filter configurations for each of the profiles. Following the Opal demo example we will be modifying the filters for the “omics” profile, which has the dsOmics package installed. To do so, we select the profile and go to the “Options” tab (illustrated on the figure 18.1). There we can manually edit the filter values at our own risk.
Figure 18.1: DataSHIELD filters on the Opal web interface
sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
[3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] missMethyl_1.24.0
[2] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
[3] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.0
[4] minfi_1.36.0
[5] bumphunter_1.32.0
[6] locfit_1.5-9.4
[7] iterators_1.0.13
[8] foreach_1.5.1
[9] Biostrings_2.58.0
[10] XVector_0.30.0
[11] SummarizedExperiment_1.20.0
[12] MatrixGenerics_1.2.1
[13] matrixStats_0.58.0
[14] clusterProfiler_3.18.1
[15] org.Hs.eg.db_3.12.0
[16] ggnewscale_0.4.5
[17] enrichplot_1.10.2
[18] DOSE_3.16.0
[19] biomaRt_2.46.3
[20] gt_0.3.1
[21] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[22] GenomicFeatures_1.42.3
[23] AnnotationDbi_1.52.0
[24] Biobase_2.50.0
[25] GenomicRanges_1.42.0
[26] GenomeInfoDb_1.26.7
[27] IRanges_2.24.1
[28] S4Vectors_0.28.1
[29] BiocGenerics_0.36.0
[30] ggrepel_0.9.1
[31] patchwork_1.1.1
[32] forcats_0.5.1
[33] stringr_1.4.0
[34] dplyr_1.0.5
[35] purrr_0.3.4
[36] readr_1.4.0
[37] tidyr_1.2.0
[38] tibble_3.1.0
[39] ggplot2_3.3.3
[40] tidyverse_1.3.0
[41] magrittr_2.0.1
[42] dsOmicsClient_1.0.7-13
[43] DSLite_1.2.0
[44] dsBaseClient_6.1.1
[45] DSOpal_1.3.0
[46] opalr_3.0.0
[47] httr_1.4.2
[48] DSI_1.3.0
[49] R6_2.5.0
[50] progress_1.2.2
[51] BiocStyle_2.18.1
loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 rtracklayer_1.49.5
[3] bit64_4.0.5 knitr_1.37
[5] multcomp_1.4-16 DelayedArray_0.16.3
[7] data.table_1.13.0 RCurl_1.98-1.3
[9] GEOquery_2.58.0 generics_0.1.0
[11] metap_1.4 preprocessCore_1.52.1
[13] cowplot_1.1.1 TH.data_1.0-10
[15] RSQLite_2.2.5 shadowtext_0.0.8
[17] bit_4.0.4 mutoss_0.1-12
[19] xml2_1.3.2 lubridate_1.7.10
[21] assertthat_0.2.1 viridis_0.5.1
[23] xfun_0.28 hms_1.0.0
[25] jquerylib_0.1.4 evaluate_0.14
[27] fansi_0.4.2 scrime_1.3.5
[29] dbplyr_2.1.1 readxl_1.3.1
[31] igraph_1.2.6 DBI_1.1.1
[33] tmvnsim_1.0-2 reshape_0.8.8
[35] ellipsis_0.3.1 backports_1.2.1
[37] bookdown_0.24 annotate_1.68.0
[39] sparseMatrixStats_1.2.1 vctrs_0.3.8
[41] cachem_1.0.4 withr_2.4.3
[43] ggforce_0.3.3 emo_0.0.0.9000
[45] GenomicAlignments_1.26.0 prettyunits_1.1.1
[47] mclust_5.4.7 mnormt_2.0.2
[49] crayon_1.4.1 genefilter_1.72.1
[51] pkgconfig_2.0.3 labeling_0.4.2
[53] tweenr_1.0.2 nlme_3.1-152
[55] rlang_0.4.10 lifecycle_1.0.0
[57] sandwich_3.0-0 downloader_0.4
[59] BiocFileCache_1.14.0 mathjaxr_1.4-0
[61] modelr_0.1.8 cellranger_1.1.0
[63] polyclip_1.10-0 rngtools_1.5
[65] base64_2.0 Matrix_1.3-2
[67] Rhdf5lib_1.12.1 zoo_1.8-9
[69] reprex_2.0.0 png_0.1-7
[71] viridisLite_0.3.0 bitops_1.0-6
[73] rhdf5filters_1.2.0 blob_1.2.1
[75] DelayedMatrixStats_1.12.3 doRNG_1.8.2
[77] qvalue_2.22.0 nor1mix_1.3-0
[79] scales_1.1.1 memoise_2.0.0
[81] plyr_1.8.6 zlibbioc_1.36.0
[83] compiler_4.0.4 scatterpie_0.1.6
[85] RColorBrewer_1.1-2 illuminaio_0.32.0
[87] plotrix_3.8-1 Rsamtools_2.6.0
[89] cli_2.4.0 MASS_7.3-53
[91] tidyselect_1.1.0 stringi_1.5.3
[93] highr_0.8 yaml_2.2.1
[95] GOSemSim_2.16.1 askpass_1.1
[97] grid_4.0.4 sass_0.4.0
[99] fastmatch_1.1-0 tools_4.0.4
[101] rstudioapi_0.13 gridExtra_2.3
[103] farver_2.1.0 ggraph_2.0.5
[105] digest_0.6.27 rvcheck_0.1.8
[107] BiocManager_1.30.12 quadprog_1.5-8
[109] Rcpp_1.0.7 siggenes_1.64.0
[111] broom_0.7.6 Rdpack_2.1.1
[113] colorspace_2.0-0 rvest_0.3.6
[115] XML_3.99-0.6 fs_1.5.0
[117] splines_4.0.4 statmod_1.4.35
[119] sn_2.0.0 graphlayouts_0.7.1
[121] multtest_2.46.0 xtable_1.8-4
[123] jsonlite_1.7.2 tidygraph_1.2.0
[125] TFisher_0.2.0 pillar_1.5.1
[127] htmltools_0.5.2 mime_0.10
[129] glue_1.4.2 fastmap_1.1.0
[131] BiocParallel_1.24.1 beanplot_1.2
[133] codetools_0.2-18 fgsea_1.16.0
[135] mvtnorm_1.1-1 utf8_1.2.1
[137] lattice_0.20-41 bslib_0.3.1
[139] numDeriv_2016.8-1.1 BiasedUrn_1.07
[141] curl_4.3 GO.db_3.12.1
[143] openssl_1.4.3 survival_3.2-7
[145] limma_3.46.0 rmarkdown_2.11
[147] munsell_0.5.0 DO.db_2.9
[149] rhdf5_2.34.0 GenomeInfoDbData_1.2.4
[151] HDF5Array_1.18.1 labelled_2.8.0
[153] haven_2.3.1 reshape2_1.4.4
[155] gtable_0.3.0 rbibutils_2.1
This bookdown has had the contributions by the following people.
The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎
The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎
It’s important to note that all the other encodings present on the case-control column that are not the case or control, will be turned into missings. As an example, if we have "cancer"/"no cancer"/"no information"/"prefer not to answer" we can specifcy case="cancer", control="no cancer" and all the individuals with "no information"/"prefer not to answer" will be interpreted as missings.↩︎
The same methodology and code can be used with unitary VCF resources that contain the variant information of all chromosomes.↩︎